Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipam; introduce garbage collection logic #532

Merged
merged 2 commits into from
Aug 9, 2024
Merged

ipam; introduce garbage collection logic #532

merged 2 commits into from
Aug 9, 2024

Conversation

zolug
Copy link
Collaborator

@zolug zolug commented Jun 26, 2024

Description

Introduced garbage collection logic in IPAM to clean up IP records not updated for a long time and hence considered not in use anymore by any NSM connections.

For this the database model has been extended with UpdatedAt (timestamp) and Expirable (*bool) fields. UpdatedAt is used by newly introduced garbage collector logic along with a threshold time to determine if a record can be removed. While Expirable limits the scope of database records the garbage collector mechanism applies to. The check is run periodically. Both the threshold and periodicity interval are configurable through environment variables (IPAM_GARBAGE_COLLECTION_THRESHOLD and IPAM_GARBAGE_COLLECTION_INTERVAL).

As part of the changes node allocations do update existing database records to keep UpdatedAt values up to date.

The logic takes only records that belong to an NSM connection into consideration. This is achieved by means of applying a filter matching records whose Expirable field is true denoting NSM connection related entries. (For other allocations the IPAM is not consulted periodically.)

The model update re-uses the old database table extending it with UpdatedAt and Expirable columns. In case of rollback the excess columns are simply ignored by the "framework" without causing any harm.
When moving from a version using the old model to the new model, then automigrate won't set UpdatedAt values for existing records. However, a default false value is configured for Expirable which will be set for all records after automigrate. A custom migrate function takes care of the update of existing connection records setting their Expirable and UpdatedAt values, allowing them to be subject to the garbage collection mechanism.

In order to limit API changes the value of Expirable field is signalled via the context. Also, to avoid upgrade issues IPAM
gRPC proto haven't been extended (to allow IPAM users to chose if their allocation could expire).

The garbage collection (by default enabled) can be disabled through environment variable IPAM_GARBAGE_COLLECTION_ENABLED.

Notes:
IPAM_GARBAGE_COLLECTION_INTERVAL also determines when the garbage collector will be first invoked. Thus, setting it to a low value must be avoided to allow existing NSM connections to get their IP allocations refreshed as part of the recurring NSM connection refresh which is tied to NSM's MaxTokenLifetime (e.g. during IPAM upgrade/failover).
IPAM_GARBAGE_COLLECTION_THRESHOLD must be higher than NSM's MaxTokenLifetime.

Issue link

#527

Checklist

  • Purpose
    • Bug fix
    • New functionality
    • Documentation
    • Refactoring
    • CI
  • Test
    • Unit test
    • E2E Test
    • Tested manually
  • Introduce a breaking change
    • Yes (description required)
    • No

@zolug zolug added kind/enhancement New feature or request component/IPAM labels Jun 26, 2024
@zolug zolug self-assigned this Jun 26, 2024
@zolug zolug requested a review from LionelJouin June 26, 2024 11:39
@zolug zolug added the size/M label Jun 26, 2024
Copy link
Member

@LionelJouin LionelJouin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like relying on the src and dst prefixes as nothing in the API is saying about the usage of those.
I guess if a node or a trench have their name finishing by -src or -dst, problem could happen.

}

// Log the fetched entries
for _, entry := range deleteEntries {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why logging in this loop and not in the next one?


// Fetch and delete -src/-dst records whose updatedAt timestamp is considered
// expired according to the threshold.
func (sqlis *SQLiteIPAMStorage) garbageCollector(ctx context.Context, threshold time.Duration) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both oneTimeGarbageCollector and garbageCollector might be the same function except the sql query

@zolug
Copy link
Collaborator Author

zolug commented Jul 19, 2024

I don't really like relying on the src and dst prefixes as nothing in the API is saying about the usage of those. I guess if a node or a trench have their name finishing by -src or -dst, problem could happen.

I do agree that the current approach is somewhat ugly and can be problematic.
My goal was to come up with a match criteria that does not require altering any of the existing fields, and could work also for entries inserted by older models. But probably the risk is way too high of removing other records.

Btw in case of trenches names, fortunately there could be no such problem as accidentally matching them, since ipam server already modifies their name based on the IP family type (adding either -0 or -1 to the trench name). Otherwise, for workers and conduits it would be definitely a problem if they would end with either -src or -dst as you pointed out.

some ideas how to move forward:

-Change IP allocation for other "IP resource types" such as trenches/conduits etc. to be recurring and hence updates could be utilized. (Therefore postfix matching could be completely ignored, and GC could be more generic.)
This might introduce a lot of changes and might be tricky to implement (haven't really looked into it). I guess it would be the best, if these updates could be done as part of ipamServer Allocate() (existence of all these are checked anyways here, but an idle system without NSM connections could be a problem).

-Add a new table that would indicate if an entry is "expirable", and GC would only consider those. In which case GC would not be able to clean-up leaked entries inserted according to old model.
Passing the information about an entry if its expirable could involve a lot of changes (inlcluding the API). Unless for example that info would be stored in the context (which normally is not a recommended usage of context yet happens often e.g. in NSM code).

-Trenches have no parents. Thefore use this knowledge to identify threnches, then fetch entries with trench parents thus getting the conduits. Then, find the conduit workers as those have conduits as parents. After that, we should have a list of trenches,conduits,conduit-workers, which could be used by GC to ignore records happaned to match anything in our list.
This obsiously exploits the hierarchy of entries, but it could be used to remove old leaked records created based on the old model.
Combining this with the "expirable" field would mean, that only the one-time GC should do all this "lookups" once.

-Rely on network prefix length. Ugly, because the storage doesn't know anything about such. But it would be an easy check for the GC to only remove entries related to /128 or /32 prefixes.
(Proxy bridges also have /32 and /128 but their name is hardcoded to bridge...)

zolug added 2 commits August 6, 2024 15:26
Updated db model with UpdatedAt and Expirable fields. Former one
is used to determine if a record associated with a NSM connection
is considered leftover, and thus is subject to garbage collection
based clean up. While Expirable limits the scope of GC mechanism.

In order to limit API changes the value of Expirable field is
signalled via the context. Also, to avoid upgrade issues IPAM
gRPC proto haven't been modified.

A custom migrate function makes sure to update any old connection
created according to the old model. So that the same GC logic can
apply to them.
@zolug zolug merged commit f192489 into master Aug 9, 2024
13 checks passed
@zolug zolug deleted the ipam-gc branch August 13, 2024 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants