Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPA Quick OOM not working as expected #7867

Open
ads79 opened this issue Feb 25, 2025 · 4 comments
Open

VPA Quick OOM not working as expected #7867

ads79 opened this issue Feb 25, 2025 · 4 comments
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@ads79
Copy link

ads79 commented Feb 25, 2025

Which component are you using?: vertical-pod-autoscaler,

/area vertical-pod-autoscaler

What version of the component are you using?:
Component version: 1.3.0 and 1.1.2

What k8s version are you using (kubectl version)?:

kubectl version Output
$ Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.29.13-eks-8cce635

What environment is this in?:

AWS EKS

What did you expect to happen?:

Quick OOM detected to bump the values by what is configured:

What happened instead?:
VPA decided to do nothing:

I0225 11:25:12.856726       1 update_priority_calculator.go:115] "Quick OOM detected in pod" pod="test/test-pod1-79dcf8655-r7sl9" containerName="test-pod1"
I0225 11:25:12.856735       1 update_priority_calculator.go:141] "Not updating pod because resource would not change" pod="test/test-pod1-79dcf8655-r7sl9"

How to reproduce it (as minimally and precisely as possible):

Trace what happens when a pod gets oomkilled.

Anything else we need to know?:

vpa recommender to take into account 1 week of usage.

      - args:
        - --address=:8942
        - --kube-api-burst=30
        - --kube-api-qps=15
        - --memory-aggregation-interval=168h0m0s
        - --memory-histogram-decay-half-life=168h0m0s
        - --oom-bump-up-ratio=2
        - --oom-min-bump-up-bytes=5.24288e+08
        - --v=4

Initial pod sizes are small 25m CPU and 300MB of RAM, hence the 2x or 512MB bump in ram on OOM

@ads79 ads79 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 25, 2025
@adrianmoisey
Copy link
Member

Interesting, I happen to be testing this in a cluster today. I'll leave my findings here
/assign

@ads79
Copy link
Author

ads79 commented Feb 25, 2025

@adriananeci , please let me know what you find... interestingly we should only get that error message if resourcediff==0

	// If the pod has quick OOMed then evict only if the resources will change
	if quickOOM && updatePriority.ResourceDiff == 0 {
		klog.V(4).InfoS("Not updating pod because resource would not change", "pod", klog.KObj(pod))
		return
	}

if quickOOM && updatePriority.ResourceDiff == 0 {

But if we've told it to bump on OOM, then should the resource diff be > 0?

@adrianmoisey
Copy link
Member

Turns out the issue I was having is unrelated

Going to unassign myself from this for now

/unassign

@voelzmo
Copy link
Contributor

voelzmo commented Feb 26, 2025

Hey @ads79 thanks for the detailed logs and parameters!
You're looking at the logs for the vpa-updater, and this might make sense:

This first line indicates that the updater found a Pod that was OOMKilled recently

I0225 11:25:12.856726       1 update_priority_calculator.go:115] "Quick OOM detected in pod" pod="test/test-pod1-79dcf8655-r7sl9" containerName="test-pod1"

Afterwards, it compares the current recommendation against the currently set requests for the Pod and checks if there is a difference. This means: has the vpa-recommender already considered the OOMKill and bumped the memory according to your settings.
The vpa-updater finds that this hasn't happened (yet) and doesn't evict the Pod. It would be pointless to do this, because all that would happen is that the Pod gets the same requests as it has right now.

So the interesting bit are the logs of the vpa-recommender, where something like this should happen: An OOMKill event is observed and parsed:

func (o *observer) OnEvent(event *apiv1.Event) {
klog.V(1).InfoS("OOM Observer processing event", "event", event)
for _, oomInfo := range parseEvictionEvent(event) {
o.observedOomsChannel <- oomInfo
}
}

and then added as new sample to the memory histogram:

case oomInfo := <-feeder.oomChan:
klog.V(3).InfoS("OOM detected", "oomInfo", oomInfo)
if err = feeder.clusterState.RecordOOM(oomInfo.ContainerID, oomInfo.Timestamp, oomInfo.Memory); err != nil {
klog.V(0).InfoS("Failed to record OOM", "oomInfo", oomInfo, "error", err)
}

and

func (cluster *ClusterState) RecordOOM(containerID ContainerID, timestamp time.Time, requestedMemory ResourceAmount) error {
pod, podExists := cluster.Pods[containerID.PodID]
if !podExists {
return NewKeyError(containerID.PodID)
}
containerState, containerExists := pod.Containers[containerID.ContainerName]
if !containerExists {
return NewKeyError(containerID.ContainerName)
}
err := containerState.RecordOOM(timestamp, requestedMemory)
if err != nil {
return fmt.Errorf("error while recording OOM for %v, Reason: %v", containerID, err)
}
return nil
}

at container aggregation level you can see your configured min bump values being applied

func (container *ContainerState) RecordOOM(timestamp time.Time, requestedMemory ResourceAmount) error {
// Discard old OOM
if timestamp.Before(container.WindowEnd.Add(-1 * GetAggregationsConfig().MemoryAggregationInterval)) {
return fmt.Errorf("OOM event will be discarded - it is too old (%v)", timestamp)
}
// Get max of the request and the recent usage-based memory peak.
// Omitting oomPeak here to protect against recommendation running too high on subsequent OOMs.
memoryUsed := ResourceAmountMax(requestedMemory, container.memoryPeak)
memoryNeeded := ResourceAmountMax(memoryUsed+MemoryAmountFromBytes(GetAggregationsConfig().OOMMinBumpUp),
ScaleResource(memoryUsed, GetAggregationsConfig().OOMBumpUpRatio))
oomMemorySample := ContainerUsageSample{
MeasureStart: timestamp,
Usage: memoryNeeded,
Resource: ResourceMemory,
}
if !container.addMemorySample(&oomMemorySample, true) {
return fmt.Errorf("adding OOM sample failed")
}
return nil
}

Are you seeing any errors in the vpa-recommender logs indicating that the OOMKill event isn't added correctly?
Also interesting: is the OOMKill done by the kubelet or the operating system?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants