Skip to content

Latest commit

 

History

History
1689 lines (1361 loc) · 72.9 KB

File metadata and controls

1689 lines (1361 loc) · 72.9 KB

KEP-3314: CSI Changed Block Tracking

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes new CSI API that can be used to identify the list of changed blocks between pairs of CSI volume snapshots. CSI drivers can implement this API to expose their changed block tracking (CBT) services to enable efficient and reliable differential backup of data stored in CSI volumes.

Kubernetes backup applications directly use this API to stream changed block information, bypassing and posing no additional load on the Kubernetes API server. The mechanism that enables this direct access utilizes a proxy service sidecar to shield the CSI drivers from managing the individual Kubernetes clients.

Motivation

Changed block tracking (CBT) techniques have been used by commercial backup systems to efficiently back up large amount of data in block volumes. They identify block-level changes between two arbitrary pair of snapshots of the same block volume, and selectively back up what has changed between the two checkpoints. This type of differential backup approach is a lot more efficient than backing up the entire volume.

This KEP proposes a design to extend the Kubernetes CSI framework to utilize these CBT features to bring efficient, cloud-native data protection to Kubernetes users.

Goals

  • Provide a secure, idiomatic CSI API to efficiently identify the allocated blocks of a CSI volume snapshot, and the changed blocks between two arbitrary pairs of CSI volume snapshots of the same block volume.
  • Relay large amount of snapshot metadata from the storage provider without overloading the Kubernetes API server.
  • This API is an optional component of the CSI framework.

Non-Goals

  • Specify how data is written to the block volume in the first place.

    The volume could be attached to a pod with either Block or Filesystem volume modes.

  • Provide an API to retrieve the data blocks of a snapshot.

    It is assumed that a snapshot's data blocks can be retrieved by creating a PersistentVolume for the snapshot, launching a pod with this volume attached in Block volume mode, and then reading the individual blocks from the raw block device.

  • Support of file changed list tracking for network file shares is not addressed by this proposal.

Proposal

The proposal extends the CSI specification with a new, optional, CSI SnapshotMetadata gRPC service, that is used by Kubernetes to retrieve metadata on the allocated blocks of a single snapshot, or the changed blocks between a pair of snapshots of the same block volume.

A Kubernetes SnapshotMetadata gRPC service is an API that is used by a Kubernetes backup application client to retrieve snapshot metadata. This API is implemented by the the community provided external-snapshot-metadata sidecar, which must be deployed by a CSI driver. A Kubernetes backup application will retrieve snapshot metadata through a TLS gRPC connection to such a service. This direct connection results in a minimal load on the Kubernetes API server, unrelated to the amount of metadata transferred or the sizes of the volumes and snapshots involved.

The external-snapshot-metadata sidecar communicates over a private UNIX domain socket with the CSI driver's implementation of the CSI SnapshotMetadata gRPC service. The CSI driver service only handles the retrieval of the metadata requested; the sidecar is responsible for validating the Kubernetes authentication token, authorizing the backup application, validating the parameters of the RPC calls and fetching the provisioner secrets needed to complete the request. The sidecar forwards the RPC call to the CSI driver service over the UNIX domain socket, after translating Kubernetes object names into SP object names, and re-streams the results back to its client.

A CSI driver advertises the existence of the community sidecar's Kubernetes SnapshotMetadata gRPC service to Kubernetes backup applications by creating a SnapshotMetadataService CR that contains the service's TCP endpoint address, CA certificate and an audience string needed for token authentication. The CSI driver name is specified in a metadata label in this CR, so that a backup application can efficiently search for the Kubernetes SnapshotMetadata gRPC service of the provisioner of the VolumeSnapshots to be backed up.

Before accessing a Kubernetes SnapshotMetadata gRPC service a backup application must first obtain an authentication token using the Kubernetes TokenRequest API with the service's audience string. It should establish trust with the specified CA for use in gRPC calls and then directly make TLS gRPC calls to the Kubernetes SnapshotMetadata gRPC service's TCP endpoint. The audience-scoped authentication token must be passed in the security_token field of each RPC request message; it will be used to authorize the backup application's use of the service. Every RPC returns a gRPC stream through which the metadata can be recovered.

The process of accessing snapshot metadata via the sidecar is illustrated in the figure below. Additional information is available in the Design Details section.

metadata retrieval flow

User Stories

Full snapshot backup

A backup application needs to perform a full backup on volumes of a specific Kubernetes application.

For each volume in the application:

  1. The backup application creates a VolumeSnapshot of a PVC that needs to be backed up.
  2. The backup application queries the changed block tracking (CBT) service to identify all the allocated data blocks in the snapshot. The CBT service returns the list of allocated blocks.
  3. Using the VolumeSnapshot as the source, the backup application creates a new PVC and mounts it with Block VolumeMode in a pod.
  4. The backup application uses the CBT metadata to identify the data that needs to be backed up and reads these blocks from the mounted PVC in the pod.

Incremental snapshot backup

A backup application needs to perform an incremental backup on volumes of a specific Kubernetes application. The backup application knows the identifiers of the VolumeSnapshots it had backed up previously.

For each volume in the application:

  1. The backup application creates a VolumeSnapshot of a PVC that needs to be backed up incrementally.
  2. The backup application queries the changed block tracking (CBT) service to identify the changes between the latest snapshot and the one it had previously backed up. The CBT service returns the list of blocks changed between the snapshots.
  3. Using the latest VolumeSnapshot as the source, the backup application creates a new PVC and mounts it with Block VolumeMode in a pod.
  4. The backup application uses the CBT metadata to find the only changed data to backup and reads these blocks from the mounted PVC in the pod.

Notes/Constraints/Caveats

  • This proposal requires a backup application to directly connect to a Kubernetes SnapshotMetadata gRPC service offered by the community provided external-snapshot-metadata sidecar deployed by a CSI driver. This was necessary to not place a load on the Kubernetes API server that would be proportional to the number of allocated blocks in a volume snapshot.

  • Each Kubernetes SnapshotMetadata gRPC service only operates on volume snapshots provisioned by the CSI driver that deploys the related sidecar. A backup application is responsible for locating the service for a CSI driver by searching for its SnapshotMetadataService CR. This search can fail as this service is optional.

  • A backup application must obtain a Kubernetes audience-scoped authentication token in order to use a Kubernetes SnapshotMetadata gRPC service. This requires that the backup application be authorized to use the Kubernetes TokenRequest API. The token can be obtained directly by a call to this API, or indirectly via a projected volume in the Pod used to access the Kubernetes SnapshotMetadata API.

  • The Kubernetes audience-scoped authentication token must be provided as the security_token field in each gRPC request message made by a backup application.

  • The CSI SnapshotMetadata gRPC service RPC calls allow an application to restart an interrupted stream from where it previously failed by reissuing the RPC call with a starting byte offset. The same functionality is available through the Kubernetes SnapshotMetadata of the service.

  • The CSI SnapshotMetadata gRPC service permits metadata to be returned in either an extent or a block based format, at the discretion of the CSI driver. A portable backup application is expected to handle both such formats. This also applies to the Kubernetes SnapshotMetadata of the service.

  • The CSI SnapshotMetadata gRPC service must be capable of serving metadata on a VolumeSnapshot concurrently with the backup application's use of a PersistentVolume created on that same VolumeSnapshot. This is because a backup application would likely mount the PersistentVolume with Block VolumeMode in a Pod in order to read and archive the raw snapshot data blocks, and this read/archive loop will be driven by the stream of snapshot block metadata.

  • The proposal does not specify how its security model is to be implemented. It is expected that the RBAC policies used by backup applications and the existing CSI drivers will be extended for this purpose.

Risks and Mitigations

A review by SIG-Auth (July 19, 2023) recommended the use of the TokenRequest and TokenReview APIs to make authentication and authorization checks possible between authorized Kubernetes principals.

The following risks are identified:

The risks are mitigated as follows:

  • The possible exposure of snapshot metadata by use of a network API is addressed by using encryption and mutual authentication for the direct gRPC call made by the backup application client. The gRPC client is required to first establish trust with the service's CA, and, while the direct TLS gRPC call itself does not perform mutual authentication, an audience-scoped authentication token must be passed as a parameter in each RPC call, which effectively provides the mechanism for the service to both authenticate and authorize the client.

    The audience-scoped authentication token is obtained from the Kubernetes TokenRequest API and is validated with the Kubernetes TokenReview API. Its scope is narrowed to the just the target service (the "audience") by specifying an audience string defined by and unique to the service, during token creation. An authentication token has an expiry time so will not last forever; additionally, it can be bound to the Pod used by the backup application to access the service, to further constrain its effective lifetime.

  • Access to a Kubernetes SnapshotMetadata gRPC service, to the VolumeSnapshots referenced by through the service, and the ability to use the TokenRequest, TokenReview and SubjectAccessReview APIs are controlled by Kubernetes security policy.

The proposal requires the existence of security policy to establish the access rights described below, and illustrated in the following figure:

Permissions needed

The proposal requires that Kubernetes security policy authorize access to:

  • The SnapshotMetadataService CR objects that advertise the existence of Kubernetes SnapshotMetadata gRPC services available in the cluster. These objects do not contain secret information so limiting access just controls the principals who obtain the service contact information. At the least, backup applications should be permitted to read these objects.

  • Backup applications must be granted permission to use the Kubernetes TokenRequest API in order to obtain the audience-scoped authentication tokens that are passed in each Kubernetes SnapshotMetadata gRPC service RPC call.

  • Backup applications must be granted access to view VolumeSnapshot objects in the target namespaces. Presumably they already have such permission if they were the ones initiating the creation of the VolumeSnapshot objects.

  • The CSI driver service account must be granted permission to use the Kubernetes TokenReview API in order for its Kubernetes SnapshotMetadata gRPC service to validate the authentication token.

  • The CSI driver must be granted permission to use the Kubernetes SubjectAccessReview API in order for its Kubernetes SnapshotMetadata gRPC service to validate that a security token authorizes access to VolumeSnapshot objects in a namespace.

  • The CSI driver presumably already has access rights to the VolumeSnapshot and VolumeSnapshot content objects as they are within its purview. This is needed for its Kubernetes SnapshotMetadata gRPC service.

The proposal does not specify how such a security policy is to be configured.

Design Details

The CSI SnapshotMetadata Service API

In this section we use the terminology of the gRPC specification where the word service is assumed to be a gRPC service and not a Kubernetes service, while the word plugin is used to refer to the software component that implements the gRPC service.

The CSI specification will be extended with the addition of the following new, optional SnapshotMetadata gRPC service. The SP Snapshot Metadata plugin implements this service.

The service is defined as follows, and will be described in the sub-sections below. Refer to CSI PR 551 for the official specification.

service SnapshotMetadata {
  rpc GetMetadataAllocated(GetMetadataAllocatedRequest)
    returns (stream GetMetadataAllocatedResponse) {}
  rpc GetMetadataDelta(GetMetadataDeltaRequest)
    returns (stream GetMetadataDeltaResponse) {}
}

enum BlockMetadataType {
  UNKNOWN=0;
  FIXED_LENGTH=1;
  VARIABLE_LENGTH=2;
}

message BlockMetadata {
  int64 byte_offset = 1;
  int64 size_bytes = 2;
}

message GetMetadataAllocatedRequest {
  string snapshot_id = 1;
  int64 starting_offset = 2;
  int32 max_results = 3;
  map<string, string> secrets = 4;
}


message GetMetadataAllocatedResponse {
  BlockMetadataType block_metadata_type = 1;
  int64 volume_capacity_bytes = 2;
  repeated BlockMetadata block_metadata = 3;
}

message GetMetadataDeltaRequest {
  string base_snapshot_id = 1;
  string target_snapshot_id = 2;
  int64 starting_offset = 3;
  int32 max_results = 4;
  map<string, string> secrets = 5;
}

message GetMetadataDeltaResponse {
  BlockMetadataType block_metadata_type = 1;
  int64 volume_capacity_bytes = 2;
  repeated BlockMetadata block_metadata = 3;
}

Metadata Format

Block volume data ranges are specified by a sequence of (ByteOffset, Length) tuples, with the tuples in ascending order of ByteOffset and no overlap between adjacent tuples. There are two prevalent styles, extent-based or block-based, which describe if the Length field of the tuples in a sequence can vary or are fixed across all the tuples in the sequence. The SnapshotMetadata service permits either style at the discretion of the plugin, and it is required that a client of this service be able to handle both styles.

The BlockMetadataType enumeration specifies the style used: FIXED_LENGTH or VARIABLE_LENGTH. When the block-based style (FIXED_LENGTH) is used it is up to the SP plugin to define the block size.

An individual tuple is identified by the BlockMetadata message, and the sequence is defined collectively across the tuple lists returned in the RPC message stream. Note that the plugin must ensure that the style is not change mid-stream in any given RPC invocation.

GetMetadataAllocated RPC

The GetMetadataAllocated RPC returns metadata on the allocated blocks of a snapshot - i.e. this identifies the data ranges that have valid data as they were the target of some previous write operation. Backup applications typically make an initial full backup of a volume followed by a series of incremental backups, and the size of the initial full backup can be reduced considerably if only the allocated blocks are saved.

The RPC's input arguments are specified by the GetMetadataAllocatedRequest message, and it returns a stream of GetMetadataAllocatedResponse messages. The fields of the GetMetadataAllocatedRequest message are defined as follows:

  • snapshot_id
    The identifier of a snapshot of the specified volume, in the nomenclature of the plugins.

  • starting_offset
    This specifies the 0 based starting byte position in the volume snapshot from which the result should be computed. It is intended to be used to continue a previously interrupted call. The plugins may round down this offset to the nearest alignment boundary based on the BlockMetadataType it will use.

  • max_results
    This is an optional field. If non-zero it specifies the maximum length of the block_metadata list that the client wants to process in a given GetAllocateResponse element. The plugins will determine an appropriate value if 0, and is always free to send less than the requested maximum.

  • secrets
    This is an optional field. It should contain the provisioner secrets associated with the volume snapshot, if any. In Kubernetes such data is specified with the keys csi.storage.k8s.io/snapshotter-secret-[name|namespace] in the VolumeSnapshotClass.Parameters field.

The fields of the GetMetadataAllocatedResponse message are defined as follows:

  • block_metadata_type
    This specifies the metadata format as described in the Metadata Format section above.

  • volume_capacity_bytes
    The size of the underlying volume, in bytes.

  • block_metadata
    This is a list of BlockMetadata tuples as described in the Metadata Format section above. The caller may request a maximum length of this list in the max_results field of the GetMetadataAllocatedRequest message, otherwise the length is determined by the plugins.

Note that while the block_metadata_type and volume_capacity_bytes fields are repeated in each GetMetadataAllocatedResponse message by the nature of the syntax of the specification language, their values in a given RPC invocation must be constant. i.e. a plugin is not free to modify these value mid-stream.

GetMetadataAllocated Errors

If the plugin is unable to complete the GetMetadataAllocated call successfully it must return a non-OK gRPC code in the gRPC status.

The following conditions are well defined:

Condition gRPC Code Description Recovery Behavior
Missing or otherwise invalid argument 3 INVALID_ARGUMENT Indicates that a required argument field was not specified or an argument value is invalid The caller should correct the error and resubmit the call.
Invalid snapshot 5 NOT_FOUND Indicates that the snapshot specified was not found. The caller should re-check that this object exists.
Invalid starting_offset 11 OUT_OF_RANGE The starting offset exceeds the volume size. The caller should specify a starting_offset less than the volume's size.

GetMetadataDelta RPC

The GetMetadataDelta RPC returns the metadata on the blocks that have changed between a pair of snapshots from the same volume.

The RPC's input arguments are specified by the GetMetadataDeltaRequest message, and it returns a stream of GetMetadataDeltaResponse messages. The fields of the GetMetadataDeltaRequest message are defined as follows:

  • base_snapshot_id
    The identifier of a snapshot of the specified volume, in the nomenclature of the plugins.

  • target_snapshot_id
    The identifier of a second snapshot of the specified volume, in the nomenclature of the plugins. This snapshot should have been created after the base snapshot, and the RPC will return the changes made since the base snapshot was created.

  • starting_offset
    This specifies the 0 based starting byte position in the target_snapshot from which the result should be computed. It is intended to be used to continue a previously interrupted call. The plugins may round down this offset to the nearest alignment boundary based on the BlockMetadataType it will use.

  • max_results
    This is an optional field. If non-zero it specifies the maximum length of the block_metadata list that the client wants to process in a given GetMetadataDeltaResponse element. The plugins will determine an appropriate value if 0, and is always free to send less than the requested maximum.

  • secrets
    This is an optional field. It should contain the provisioner secrets associated with the volume snapshot, if any. In Kubernetes such data is specified with the keys csi.storage.k8s.io/snapshotter-secret-[name|namespace] in the VolumeSnapshotClass.Parameters field. This field should not be set by a Kubernetes backup application client.

The fields of the GetMetadataDeltaResponse message are defined as follows:

  • block_metadata_type
    This specifies the metadata format as described in the Metadata Format section above.

  • volume_capacity_bytes
    The size of the underlying volume, in bytes.

  • block_metadata
    This is a list of BlockMetadata tuples as described in the Metadata Format section above. The caller may request a maximum length of this list in the max_results field of the GetMetadataDeltaRequest message, otherwise the length is determined by the plugins.

Note that while the block_metadata_type and volume_capacity_bytes fields are repeated in each GetMetadataDeltaResponse message by the nature of the syntax of the specification language, their values in a given RPC invocation must be constant. i.e. a plugin is not free to modify these value mid-stream.

GetMetadataDelta Errors

If the plugin is unable to complete the GetMetadataDelta call successfully it must return a non-OK gRPC code in the gRPC status.

The following conditions are well defined:

Condition gRPC Code Description Recovery Behavior
Missing or otherwise invalid argument 3 INVALID_ARGUMENT Indicates that a required argument field was not specified or an argument value is invalid The caller should correct the error and resubmit the call.
Invalid base_snapshot or target_snapshot 5 NOT_FOUND Indicates that the snapshots specified were not found. The caller should re-check that these objects exist.
Invalid starting_offset 11 OUT_OF_RANGE The starting offset exceeds the volume size. The caller should specify a starting_offset less than the volume's size.

Kubernetes Components

The following Kubernetes resources and components are involved at runtime:

The Kubernetes SnapshotMetadata Service API

The proposal minimizes the use of the Kubernetes API server when retrieving snapshot metadata. Instead, it calls for a Kubernetes client to directly make a gRPC connection with a community provided external-snapshot-metadata sidecar which will proxy the calls to an SP provided service that implements the CSI SnapshotMetadata Service API.

The proposal could have called for client to use the CSI SnapshotMetadata Service API itself when communicating with the sidecar, but there are a number of reasons to specify a different API, the Kubernetes SnapshotMetadata API. These reasons and any related modifications to the API are described below:

  • Introduction of a level of indirection between the CSI specification and the Kubernetes implementation to decouple the life-cycle of the external-snapshot-metadata sidecar from the life-cycle of the CSI specification.

  • The need to pass a Kubernetes audience-scoped authentication token to the sidecar. This is handled by introducing a security_token field in every request message. All Kubernetes SnapshotMetadata API calls will return the UNAUTHENTICATED gRPC error code if this token is incorrect or adequate authority has not been granted to the invoker.

  • Kubernetes Snapshots must be identified by namespace and name, while the CSI specification uses a single identifier. This difference is handled by the addition of a namespace parameter in every Kubernetes SnapshotMetadata API message that identifies a Snapshot, and the replacement of snapshot "id" parameters with "name" parameters.

  • The CSI specification requires provisioner secrets be passed to the SP service if configured for the CSI driver. These secrets are out of the purview of a backup application and are fetched and inserted by the sidecar. As such, all secrets parameters are removed from the Kubernetes SnapshotMetadata API messages.

The proposed Kubernetes SnapshotMetadata API is very similar to the CSI SnapshotMetadata Service API. The only structural differences between the two specifications are in some message properties. The messages modified in the Kubernetes SnapshotMetadata API are shown below:

message GetMetadataAllocatedRequest {
  string security_token = 1;
  string namespace = 2;
  string snapshot_name = 3;
  int64 starting_offset = 4;
  int32 max_results = 5;
}

message GetMetadataDeltaRequest {
  string security_token = 1;
  string namespace = 2;
  string base_snapshot_name = 3;
  string target_snapshot_name = 4;
  int64 starting_offset = 5;
  int32 max_results = 6;
}

The full specification of the Kubernetes SnapshotMetadata API will be published in the source code repository of the external-snapshot-metadata sidecar.

Snapshot Metadata Service Custom Resource

SnapshotMetadataService is a cluster-scoped Custom Resource that defines the existence of, and contains information needed to connect to the Kubernetes SnapshotMetadata gRPC service provided by the external-snapshot-metadata sidecar deployed by a CSI driver.

The CR name should be that of the associated CSI driver to ensure that only one such CR is created for a given driver.

The CR spec contains the following fields:

  • address
    Specifies the IP address or DNS name of the gRPC service. It should be provided in the format host:port, without specifying the scheme (e.g., http or https).
  • caCert
    Specifies the CA certificate used to enable TLS (Transport Layer Security) security for gRPC calls made to the service.
  • audience
    Specifies the audience string value expected in an audience-scoped authentication token presented to the sidecar by a Kubernetes client. The value should be unique to the service if possible; for example, it could be the DNS name of the service.

The full Custom Resource Definition is shown below:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.11.1
    api-approved.kubernetes.io: unapproved
  creationTimestamp: null
  name: snapshotmetadataservices.cbt.storage.k8s.io
spec:
  group: cbt.storage.k8s.io
  names:
    kind: SnapshotMetadataService
    listKind: SnapshotMetadataServiceList
    plural: snapshotmetadataservices
    singular: snapshotmetadataservice
  scope: Cluster
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        description: 'The presence of a SnapshotMetadataService CR advertises the existence of a CSI
          driver's Kubernetes SnapshotMetadata gRPC service.
          An audience scoped Kubernetes authentication bearer token must be passed in the
          "security_token" field of each gRPC call made by a Kubernetes backup client.'
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values.
              More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase.
              More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            type: object
          spec:
            description: This contains data needed to connect to a Kubernetes SnapshotMetadata gRPC service.
            properties:
              address:
                type: string
                description: The TCP endpoint address of the gRPC service.
              audience:
                type: string
                description: The audience string value expected in a client's authentication token passed in the
                  "security_token" field of each gRPC call.
              caCert:
                description: Certificate authority bundle needed by the client to validate the service.
                format: byte
                type: string
            type: object
        type: object
    served: true
    storage: true

The SP Snapshot Metadata Service

The SP must provide a container that implements the CSI SnapshotMetadata gRPC service that sources the snapshot metadata requested by a Kubernetes client. The service runs under the authority of the CSI driver ServiceAccount, which must be authorized as described in Risks and Mitigations.

The CSI driver should delegate all Kubernetes client interaction to the community provided external-snapshot-metadata sidecar which must be deployed in a pod alongside its service container, and configured to communicate with it over a UNIX domain socket. The sidecar will translate all Kubernetes object identifiers used by the Kubernetes client to SP identifiers before forwarding the remote procedure calls.

The SP service decides whether the metadata is returned in block-based format (block_metadata_type is FIXED_LENGTH) or extent-based format (block_metadata_type is VARIABLE_LENGTH). In any given RPC call, the block_metadata_type and volume_capacity_bytes return properties should be constant; likewise the size_bytes of all BlockMetadata entries if the block_metadata_type value returned is FIXED_LENGTH.

The External Snapshot Metadata Sidecar

The external-snapshot-metadata sidecar is a community provided container that handles all aspects of Kubernetes client interaction for a SP Snapshot Metadata Service. The sidecar should be configured to run under the authority of the CSI driver ServiceAccount, which must be authorized as described in Risks and Mitigations.

A Service object must be created for the TCP based Kubernetes SnapshotMetadata gRPC service implemented by the sidecar. A SnapshotMetadataService CR must be created for the gRPC service within the sidecar. The CR contains the CA certificate and Service endpoint address of the sidecar and the audience string needed for the client's authentication token. The sidecar must be configured with the name of this CR object.

The sidecar must be deployed in the same pod as the SP Snapshot Metadata Service and must be configured to communicate with it through a UNIX domain socket.

The sidecar acts as a proxy for the SP Snapshot Metadata Service, handling all aspects of the Kubernetes client interaction described in this proposal, including

  • Authenticating the client.
  • Authorizing the client.
  • Validating individual RPC arguments.
  • Translating RPC arguments from the Kubernetes domain to the SP domain at runtime.
  • Fetching the vendor secrets identified by the VolumeSnapshot object, if any.

A Kubernetes client must provide an audience scoped authentication token in the security_token field of every remote procedure call request message. The sidecar will use the TokenReview API to validate this authentication token, using the audience string specified in the SnapshotMetadataService CR.

If the client is authenticated, the sidecar will then use the returned UserInfo in the SubjectAccessReview API to verify that the client has access to VolumeSnapshot objects in the namespace specified in the remote procedure call.

The sidecar will attempt to load the VolumeSnapshots specified along with their associated VolumeSnapshotContent objects, to ensure that they still exist, belong to the CSI driver, and to obtain their SP identifiers. Additional checks may be performed depending on the RPC; for example, in the case of a GetMetadataDelta RPC, it will check that all the snapshots come from the same volume and that the snapshot order is correct.

If all checks are successful, the RPC call is forwarded to the CSI SnapshotMetadata gRPC service in the SP Snapshot Metadata Service container over the UNIX domain socket, with its input parameters appropriately translated. The metadata result stream is sent back to the calling Kubernetes client.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests

All unit tests will be included in the out-of-tree CSI repositories, with no impact on the test coverage of the core packages.

These include:

  • Unit tests to cover the logic around retrieving the audience-scoped token and volume snapshot name and namespace from the gGRPC request metadata.
  • Unit tests to cover the logic around getting the snapshot, snapshot content, snapshot class, volume handle, and metadata service resources.
  • Unit tests to cover the logic around creating token review and subject access review resources.
  • Unit tests to ensure no tokens/secrets are logged by k8s-csi logging methods.
Integration tests

No integration tests are required. This feature is better tested with e2e tests.

e2e tests

The prototype project of this KEP contains a sample gRPC client that can be used to simulate gRPC requests to the SnapshotMetadata service.

The e2e tests will test the SnapshotMetadata service ability to:

  • Handle and authenticate gRPC requests from the sample client to:
    • Get allocated blocks of a PVC
    • Get changed blocks between a pair of snapshots
  • Get the following kinds of API resources, based on the gRPC requests and CSI provisioner:
    • SnapshotMetadataService
    • TokenReview
    • SubjectAccessReview
    • VolumeSnapshot
    • VolumeSnapshotContent
  • Marshal and stream responses from the mock SP service to the sample client

Graduation Criteria

Alpha

  • Approval of the proposed CRDs and GRPC specification.
  • The SnapshotMetadata service can handle gRPC requests to get allocated and changed blocks, and stream responses back to client, without imposing load on the K8s API server.
  • Initial e2e tests completed and enabled.

Beta

  • Involve 2 different storage providers and backup applications in the implementations of the SnapshotMetadata service to enable successful e2e backup workflow.
  • Increase e2e test coverage.
  • Gather feedback from CSI driver maintainers and backup users, especially on performance metrics.

GA

  • CSI drivers include the SnapshotMetadata service as part of their distros.
  • Allowing time for user feedback.

Upgrade / Downgrade Strategy

Version Skew Strategy

If the external-snapshot-metadata sidecar is added to an older CSI driver that doesn't implement the CBT feature, the sidecar returns a gRPC UNIMPLEMENTED status code (HTTP 501) to the backup application.

The snapshot metadata components support the v1 snapshot objects (VolumeSnapshot, VolumeSnapshotContent, VolumeSnapshotClass etc.), starting with snapshot controller v7.0. There are no plans to support earlier beta and alpha versions of the snapshot objects.

This enhancement focuses only on the control plane with no effects on the kubelet.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name:
    • Components depending on the feature gate:
  • Other
    • Describe the mechanism: The new components will be implemented as part of the out-of-tree CSI framework. Storage providers can embed the sidecar component in their CSI drivers, if they choose to support this feature.
    • Will enabling / disabling the feature require downtime of the control plane? No.
    • Will enabling / disabling the feature require downtime or reprovisioning of a node? No.
Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled by removing the sidecar from the CSI driver.

What happens if we reenable the feature if it was previously rolled back?

No effects.

Are there any tests for feature enablement/disablement?

No.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?
  • API call type (e.g. PATCH pods):
    • GET VolumeSnapshot, VolumeSnapshotContent, VolumeSnapshotClass, SnapshotMetadataService
    • CREATE TokenRequest, TokenReview, SubjectAccessReview, SnapshotMetadataService
  • estimated throughput: one object of each kind, per request
  • originating component(s) (e.g. Kubelet, Feature-X-controller):
    • user's backup application
    • snapshot metadata sidecar
  • components listing and/or watching resources they didn't before: N/A
  • API calls that may be triggered by changes of some Kubernetes resources (e.g. update of object X triggers new updates of object Y): N/A. All API calls are initiated by the user's backup application
  • periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.): N/A
Will enabling / using this feature result in introducing new API types?
  • API type: SnapshotMetadataService
  • Supported number of objects per cluster: One object for every CSI driver that supports CBT
  • Supported number of objects per namespace (for namespace-scoped objects): N/A
Will enabling / using this feature result in any new calls to the cloud provider?

The CSI driver plugin will call the cloud provider's CBT API to retrieve the CBT snapshot metadata.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Existing API objects will not be affected.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

The aggregated API server solution described in #3367 was deemed unsuitable because of the potentially large amount of CBT payloads that will be proxied through the K8s API server. Further discussion can be found in this thread.

An approach based on using volume populator to store the CBT payloads on intermediary storage, instead of sending them over the network was also considered. But the amount of pod creation/deletion churns and latency incurred made this solution inappropriate.

The previous design which involved generating and returning a RESTful callback endpoint to the caller, to serve CBT payloads was superceded by the aggregation extension mechanism as described in #3367, due to the requirement for more structured request and response payloads.

Infrastructure Needed (Optional)