Alerting: Do not store series values from past evaluations in state manager for no reason #87525

alexweav · 2024-05-08T18:21:03Z

What is this feature?

States in the state manager store the results of several prior evaluations. This includes values, which is the measured dataframe results for that series:

grafana/pkg/services/ngalert/state/manager.go

Lines 361 to 368 in 6e4d35e

    
           currentState.Results = append(currentState.Results, Evaluation{ 
        
           	EvaluationTime:  result.EvaluatedAt, 
        
           	EvaluationState: result.State, 
        
           	Values:          NewEvaluationValues(result.Values), 
        
           	Condition:       alertRule.Condition, 
        
           }) 
        
           currentState.LastEvaluationString = result.EvaluationString 
        
           currentState.TrimResults(alertRule)

But, the only code that ever reads this field, only looks at the most recent 1 evaluation. Any values in that slice beyond the last one are stored for no reason:

grafana/pkg/services/ngalert/state/state.go

Lines 467 to 471 in 6e4d35e

    
           if len(a.Results) <= 0 { 
        
           	return nil 
        
           } 
        
           lastResult := a.Results[len(a.Results)-1]

The state manager is storing every measured value for every series in every query, including intermediate dataframes. Basically, all queried dataframes for all rules, for multiplpe evaluations in the past, needlessly.

The length of the history is based on For - and can be raised arbitrarily high with a long for. Even for rules with a for of 0, the length is hardcoded to 10 evaluations. For a typical rule with 3 query nodes, this results in 30 frames worth of series kept in memory (spread across states) when we only use 3 of them.

By creating a long For with many dimensions, you can write a rule that eventually consumes arbitrary memory and OOMs grafana provided the process does not restart.

This is a truly incredible amount of data in some cases on instances even with simpler alerting usage it can quickly dominate the entire memory space of unified alerting.

Why do we need this feature?

Likely, most memory consumption of unified alerting across the board is occupied by this throwaway data.

This PR simply stores the most recent single evaluation, rather than a running slice of them. The result is transparent to users, but we stop hanging on to data that we won't use.

Which issue(s) does this PR fix?:
n/a

Special notes for your reviewer:

Please check that:

It works as expected from a user's perspective.
If this is a pre-GA feature, it is behind a feature toggle.
The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

alexweav · 2024-05-08T19:34:44Z

/deploy-to-hg

ephemeral-instances-bot · 2024-05-08T19:36:13Z

Preparing your instance. A comment containing your instance's url will be added to this PR when the instance is ready.
Your instance will be ready in ~10 minutes. Follow the workflow progress
Slack channel: #proj-ephemeral-hg-instances
Building instance with alexweav/do-not-store-old-results oss branch and main enterprise branch. How to choose a branch

ephemeral-instances-bot · 2024-05-08T19:49:52Z

Your instance can be accessed at: https://ephemeral1511182187525alexwea.grafana-dev.net
The instance is not using the CDN assets.
How to access / How to update instance config / How to build a specific branch

yuri-tceretian

Nice improvement. It's long overdue! LGTM

JacobsonMT

LGTM 🚀

…anager for no reason (#87525) Do not store previous execution results on states (cherry picked from commit a6a9ab4)

…in state manager for no reason (#87845) Alerting: Do not store series values from past evaluations in state manager for no reason (#87525) Do not store previous execution results on states (cherry picked from commit a6a9ab4) Co-authored-by: Alexander Weaver <[email protected]>

alexweav added type/bug area/alerting Grafana Alerting type/performance area/backend labels May 8, 2024

alexweav added this to the 11.1.x milestone May 8, 2024

alexweav force-pushed the alexweav/do-not-store-old-results branch from c47b011 to ad83b8c Compare May 8, 2024 18:30

alexweav changed the title ~~Alerting: Do not store stale intermediate results in state manager~~ Alerting: Do not store stale values from past evaluations in state manager May 8, 2024

Do not store previous execution results on states

142ac24

alexweav force-pushed the alexweav/do-not-store-old-results branch from ad83b8c to 142ac24 Compare May 8, 2024 19:07

grafana deleted a comment from ephemeral-instances-bot bot May 8, 2024

grafana deleted a comment from PoorlyDefinedBehaviour May 8, 2024

grafana deleted a comment from ephemeral-instances-bot bot May 8, 2024

grafana deleted a comment from PoorlyDefinedBehaviour May 8, 2024

grafana deleted a comment from ephemeral-instances-bot bot May 8, 2024

alexweav changed the title ~~Alerting: Do not store stale values from past evaluations in state manager~~ Alerting: Do not store series values from past evaluations in state manager for no reason May 8, 2024

alexweav added the add to changelog label May 9, 2024

alexweav marked this pull request as ready for review May 9, 2024 13:47

alexweav requested review from a team and rwwiv and removed request for a team May 9, 2024 13:47

alexweav requested review from JacobsonMT, yuri-tceretian and grobinson-grafana May 9, 2024 13:47

yuri-tceretian approved these changes May 9, 2024

View reviewed changes

JacobsonMT approved these changes May 9, 2024

View reviewed changes

alexweav merged commit a6a9ab4 into main May 9, 2024
19 checks passed

alexweav deleted the alexweav/do-not-store-old-results branch May 9, 2024 20:51

alexweav added the backport v11.0.x Mark PR for automatic backport to v11.0.x label May 14, 2024

grafana-delivery-bot bot pushed a commit that referenced this pull request May 14, 2024

Alerting: Do not store series values from past evaluations in state m…

8d0e357

…anager for no reason (#87525) Do not store previous execution results on states (cherry picked from commit a6a9ab4)

grafana-delivery-bot bot mentioned this pull request May 14, 2024

[v11.0.x] Alerting: Do not store series values from past evaluations in state manager for no reason #87845

Merged

3 tasks

kevinwcyu modified the milestones: 11.1.x, 11.1.0 Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting: Do not store series values from past evaluations in state manager for no reason #87525

Alerting: Do not store series values from past evaluations in state manager for no reason #87525

alexweav commented May 8, 2024 •

edited

Loading

alexweav commented May 8, 2024

ephemeral-instances-bot bot commented May 8, 2024

ephemeral-instances-bot bot commented May 8, 2024

yuri-tceretian left a comment

JacobsonMT left a comment

	currentState.Results = append(currentState.Results, Evaluation{
	EvaluationTime: result.EvaluatedAt,
	EvaluationState: result.State,
	Values: NewEvaluationValues(result.Values),
	Condition: alertRule.Condition,
	})
	currentState.LastEvaluationString = result.EvaluationString
	currentState.TrimResults(alertRule)

	if len(a.Results) <= 0 {
	return nil
	}

	lastResult := a.Results[len(a.Results)-1]

Alerting: Do not store series values from past evaluations in state manager for no reason #87525

Alerting: Do not store series values from past evaluations in state manager for no reason #87525

Conversation

alexweav commented May 8, 2024 • edited Loading

alexweav commented May 8, 2024

ephemeral-instances-bot bot commented May 8, 2024

ephemeral-instances-bot bot commented May 8, 2024

yuri-tceretian left a comment

Choose a reason for hiding this comment

JacobsonMT left a comment

Choose a reason for hiding this comment

alexweav commented May 8, 2024 •

edited

Loading