-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipam; introduce garbage collection logic #532
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like relying on the src and dst prefixes as nothing in the API is saying about the usage of those.
I guess if a node or a trench have their name finishing by -src or -dst, problem could happen.
pkg/ipam/storage/sqlite/sqlite.go
Outdated
} | ||
|
||
// Log the fetched entries | ||
for _, entry := range deleteEntries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why logging in this loop and not in the next one?
|
||
// Fetch and delete -src/-dst records whose updatedAt timestamp is considered | ||
// expired according to the threshold. | ||
func (sqlis *SQLiteIPAMStorage) garbageCollector(ctx context.Context, threshold time.Duration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both oneTimeGarbageCollector
and garbageCollector
might be the same function except the sql query
I do agree that the current approach is somewhat ugly and can be problematic. Btw in case of trenches names, fortunately there could be no such problem as accidentally matching them, since ipam server already modifies their name based on the IP family type (adding either -0 or -1 to the trench name). Otherwise, for workers and conduits it would be definitely a problem if they would end with either -src or -dst as you pointed out. some ideas how to move forward: -Change IP allocation for other "IP resource types" such as trenches/conduits etc. to be recurring and hence updates could be utilized. (Therefore postfix matching could be completely ignored, and GC could be more generic.) -Add a new table that would indicate if an entry is "expirable", and GC would only consider those. In which case GC would not be able to clean-up leaked entries inserted according to old model. -Trenches have no parents. Thefore use this knowledge to identify threnches, then fetch entries with trench parents thus getting the conduits. Then, find the conduit workers as those have conduits as parents. After that, we should have a list of trenches,conduits,conduit-workers, which could be used by GC to ignore records happaned to match anything in our list. -Rely on network prefix length. Ugly, because the storage doesn't know anything about such. But it would be an easy check for the GC to only remove entries related to /128 or /32 prefixes. |
Updated db model with UpdatedAt and Expirable fields. Former one is used to determine if a record associated with a NSM connection is considered leftover, and thus is subject to garbage collection based clean up. While Expirable limits the scope of GC mechanism. In order to limit API changes the value of Expirable field is signalled via the context. Also, to avoid upgrade issues IPAM gRPC proto haven't been modified. A custom migrate function makes sure to update any old connection created according to the old model. So that the same GC logic can apply to them.
Description
Introduced garbage collection logic in IPAM to clean up IP records not updated for a long time and hence considered not in use anymore by any NSM connections.
For this the database model has been extended with UpdatedAt (timestamp) and Expirable (*bool) fields. UpdatedAt is used by newly introduced garbage collector logic along with a threshold time to determine if a record can be removed. While Expirable limits the scope of database records the garbage collector mechanism applies to. The check is run periodically. Both the threshold and periodicity interval are configurable through environment variables (IPAM_GARBAGE_COLLECTION_THRESHOLD and IPAM_GARBAGE_COLLECTION_INTERVAL).
As part of the changes node allocations do update existing database records to keep UpdatedAt values up to date.
The logic takes only records that belong to an NSM connection into consideration. This is achieved by means of applying a filter matching records whose Expirable field is true denoting NSM connection related entries. (For other allocations the IPAM is not consulted periodically.)
The model update re-uses the old database table extending it with UpdatedAt and Expirable columns. In case of rollback the excess columns are simply ignored by the "framework" without causing any harm.
When moving from a version using the old model to the new model, then automigrate won't set UpdatedAt values for existing records. However, a default false value is configured for Expirable which will be set for all records after automigrate. A custom migrate function takes care of the update of existing connection records setting their Expirable and UpdatedAt values, allowing them to be subject to the garbage collection mechanism.
In order to limit API changes the value of Expirable field is signalled via the context. Also, to avoid upgrade issues IPAM
gRPC proto haven't been extended (to allow IPAM users to chose if their allocation could expire).
The garbage collection (by default enabled) can be disabled through environment variable IPAM_GARBAGE_COLLECTION_ENABLED.
Notes:
IPAM_GARBAGE_COLLECTION_INTERVAL also determines when the garbage collector will be first invoked. Thus, setting it to a low value must be avoided to allow existing NSM connections to get their IP allocations refreshed as part of the recurring NSM connection refresh which is tied to NSM's MaxTokenLifetime (e.g. during IPAM upgrade/failover).
IPAM_GARBAGE_COLLECTION_THRESHOLD must be higher than NSM's MaxTokenLifetime.
Issue link
#527
Checklist