VPA Quick OOM not working as expected #7867

ads79 · 2025-02-25T14:00:05Z

Which component are you using?: vertical-pod-autoscaler,

/area vertical-pod-autoscaler

What version of the component are you using?:
Component version: 1.3.0 and 1.1.2

What k8s version are you using (kubectl version)?:

kubectl version Output

$ Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.29.13-eks-8cce635

What environment is this in?:

AWS EKS

What did you expect to happen?:

Quick OOM detected to bump the values by what is configured:

What happened instead?:
VPA decided to do nothing:

I0225 11:25:12.856726       1 update_priority_calculator.go:115] "Quick OOM detected in pod" pod="test/test-pod1-79dcf8655-r7sl9" containerName="test-pod1"
I0225 11:25:12.856735       1 update_priority_calculator.go:141] "Not updating pod because resource would not change" pod="test/test-pod1-79dcf8655-r7sl9"

How to reproduce it (as minimally and precisely as possible):

Trace what happens when a pod gets oomkilled.

Anything else we need to know?:

vpa recommender to take into account 1 week of usage.

      - args:
        - --address=:8942
        - --kube-api-burst=30
        - --kube-api-qps=15
        - --memory-aggregation-interval=168h0m0s
        - --memory-histogram-decay-half-life=168h0m0s
        - --oom-bump-up-ratio=2
        - --oom-min-bump-up-bytes=5.24288e+08
        - --v=4

Initial pod sizes are small 25m CPU and 300MB of RAM, hence the 2x or 512MB bump in ram on OOM

The text was updated successfully, but these errors were encountered:

adrianmoisey · 2025-02-25T14:28:15Z

Interesting, I happen to be testing this in a cluster today. I'll leave my findings here
/assign

ads79 · 2025-02-25T14:41:12Z

@adriananeci , please let me know what you find... interestingly we should only get that error message if resourcediff==0

	// If the pod has quick OOMed then evict only if the resources will change
	if quickOOM && updatePriority.ResourceDiff == 0 {
		klog.V(4).InfoS("Not updating pod because resource would not change", "pod", klog.KObj(pod))
		return
	}

autoscaler/vertical-pod-autoscaler/pkg/updater/priority/update_priority_calculator.go

Line 141 in 4e03407

if quickOOM && updatePriority.ResourceDiff == 0 {

But if we've told it to bump on OOM, then should the resource diff be > 0?

adrianmoisey · 2025-02-26T09:23:00Z

Turns out the issue I was having is unrelated

Going to unassign myself from this for now

/unassign

voelzmo · 2025-02-26T14:37:34Z

Hey @ads79 thanks for the detailed logs and parameters!
You're looking at the logs for the vpa-updater, and this might make sense:

This first line indicates that the updater found a Pod that was OOMKilled recently

I0225 11:25:12.856726       1 update_priority_calculator.go:115] "Quick OOM detected in pod" pod="test/test-pod1-79dcf8655-r7sl9" containerName="test-pod1"

Afterwards, it compares the current recommendation against the currently set requests for the Pod and checks if there is a difference. This means: has the vpa-recommender already considered the OOMKill and bumped the memory according to your settings.
The vpa-updater finds that this hasn't happened (yet) and doesn't evict the Pod. It would be pointless to do this, because all that would happen is that the Pod gets the same requests as it has right now.

So the interesting bit are the logs of the vpa-recommender, where something like this should happen: An OOMKill event is observed and parsed:

autoscaler/vertical-pod-autoscaler/pkg/recommender/input/oom/observer.go

Lines 110 to 115 in ba94506

    
           func (o *observer) OnEvent(event *apiv1.Event) { 
        
           	klog.V(1).InfoS("OOM Observer processing event", "event", event) 
        
           	for _, oomInfo := range parseEvictionEvent(event) { 
        
           		o.observedOomsChannel <- oomInfo 
        
           	} 
        
           }

and then added as new sample to the memory histogram:

autoscaler/vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go

Lines 508 to 512 in ba94506

    
           case oomInfo := <-feeder.oomChan: 
        
           	klog.V(3).InfoS("OOM detected", "oomInfo", oomInfo) 
        
           	if err = feeder.clusterState.RecordOOM(oomInfo.ContainerID, oomInfo.Timestamp, oomInfo.Memory); err != nil { 
        
           		klog.V(0).InfoS("Failed to record OOM", "oomInfo", oomInfo, "error", err) 
        
           	}

and

autoscaler/vertical-pod-autoscaler/pkg/recommender/model/cluster.go

Lines 229 to 243 in ba94506

    
           func (cluster *ClusterState) RecordOOM(containerID ContainerID, timestamp time.Time, requestedMemory ResourceAmount) error { 
        
           	pod, podExists := cluster.Pods[containerID.PodID] 
        
           	if !podExists { 
        
           		return NewKeyError(containerID.PodID) 
        
           	} 
        
           	containerState, containerExists := pod.Containers[containerID.ContainerName] 
        
           	if !containerExists { 
        
           		return NewKeyError(containerID.ContainerName) 
        
           	} 
        
           	err := containerState.RecordOOM(timestamp, requestedMemory) 
        
           	if err != nil { 
        
           		return fmt.Errorf("error while recording OOM for %v, Reason: %v", containerID, err) 
        
           	} 
        
           	return nil 
        
           }

at container aggregation level you can see your configured min bump values being applied

autoscaler/vertical-pod-autoscaler/pkg/recommender/model/container.go

Lines 184 to 204 in ba94506

    
           func (container *ContainerState) RecordOOM(timestamp time.Time, requestedMemory ResourceAmount) error { 
        
           	// Discard old OOM 
        
           	if timestamp.Before(container.WindowEnd.Add(-1 * GetAggregationsConfig().MemoryAggregationInterval)) { 
        
           		return fmt.Errorf("OOM event will be discarded - it is too old (%v)", timestamp) 
        
           	} 
        
           	// Get max of the request and the recent usage-based memory peak. 
        
           	// Omitting oomPeak here to protect against recommendation running too high on subsequent OOMs. 
        
           	memoryUsed := ResourceAmountMax(requestedMemory, container.memoryPeak) 
        
           	memoryNeeded := ResourceAmountMax(memoryUsed+MemoryAmountFromBytes(GetAggregationsConfig().OOMMinBumpUp), 
        
           		ScaleResource(memoryUsed, GetAggregationsConfig().OOMBumpUpRatio)) 
        
           	oomMemorySample := ContainerUsageSample{ 
        
           		MeasureStart: timestamp, 
        
           		Usage:        memoryNeeded, 
        
           		Resource:     ResourceMemory, 
        
           	} 
        
           	if !container.addMemorySample(&oomMemorySample, true) { 
        
           		return fmt.Errorf("adding OOM sample failed") 
        
           	} 
        
           	return nil 
        
           }

Are you seeing any errors in the vpa-recommender logs indicating that the OOMKill event isn't added correctly?
Also interesting: is the OOMKill done by the kubelet or the operating system?

ads79 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 25, 2025

k8s-ci-robot added the area/vertical-pod-autoscaler label Feb 25, 2025

k8s-ci-robot assigned adrianmoisey Feb 25, 2025

k8s-ci-robot unassigned adrianmoisey Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPA Quick OOM not working as expected #7867

VPA Quick OOM not working as expected #7867

ads79 commented Feb 25, 2025 •

edited

Loading

adrianmoisey commented Feb 25, 2025

ads79 commented Feb 25, 2025 •

edited

Loading

adrianmoisey commented Feb 26, 2025

voelzmo commented Feb 26, 2025 •

edited

Loading

VPA Quick OOM not working as expected #7867

VPA Quick OOM not working as expected #7867

Comments

ads79 commented Feb 25, 2025 • edited Loading

adrianmoisey commented Feb 25, 2025

ads79 commented Feb 25, 2025 • edited Loading

adrianmoisey commented Feb 26, 2025

voelzmo commented Feb 26, 2025 • edited Loading

ads79 commented Feb 25, 2025 •

edited

Loading

ads79 commented Feb 25, 2025 •

edited

Loading

voelzmo commented Feb 26, 2025 •

edited

Loading