- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This KEP proposes new CSI API that can be used to identify the list of changed blocks between pairs of CSI volume snapshots. CSI drivers can implement this API to expose their changed block tracking (CBT) services to enable efficient and reliable differential backup of data stored in CSI volumes.
Kubernetes backup applications directly use this API to stream changed block information, bypassing and posing no additional load on the Kubernetes API server. The mechanism that enables this direct access utilizes a proxy service sidecar to shield the CSI drivers from managing the individual Kubernetes clients.
Changed block tracking (CBT) techniques have been used by commercial backup systems to efficiently back up large amount of data in block volumes. They identify block-level changes between two arbitrary pair of snapshots of the same block volume, and selectively back up what has changed between the two checkpoints. This type of differential backup approach is a lot more efficient than backing up the entire volume.
This KEP proposes a design to extend the Kubernetes CSI framework to utilize these CBT features to bring efficient, cloud-native data protection to Kubernetes users.
- Provide a secure, idiomatic CSI API to efficiently identify the allocated blocks of a CSI volume snapshot, and the changed blocks between two arbitrary pairs of CSI volume snapshots of the same block volume.
- Relay large amount of snapshot metadata from the storage provider without overloading the Kubernetes API server.
- This API is an optional component of the CSI framework.
-
Specify how data is written to the block volume in the first place.
The volume could be attached to a pod with either
Block
orFilesystem
volume modes. -
Provide an API to retrieve the data blocks of a snapshot.
It is assumed that a snapshot's data blocks can be retrieved by creating a PersistentVolume for the snapshot, launching a pod with this volume attached in
Block
volume mode, and then reading the individual blocks from the raw block device. -
Support of file changed list tracking for network file shares is not addressed by this proposal.
The proposal extends the CSI specification with a new, optional, CSI SnapshotMetadata gRPC service, that is used by Kubernetes to retrieve metadata on the allocated blocks of a single snapshot, or the changed blocks between a pair of snapshots of the same block volume.
A Kubernetes SnapshotMetadata gRPC service is an API that is used by a Kubernetes backup application client to retrieve snapshot metadata. This API is implemented by the the community provided external-snapshot-metadata sidecar, which must be deployed by a CSI driver. A Kubernetes backup application will retrieve snapshot metadata through a TLS gRPC connection to such a service. This direct connection results in a minimal load on the Kubernetes API server, unrelated to the amount of metadata transferred or the sizes of the volumes and snapshots involved.
The external-snapshot-metadata sidecar communicates over a private UNIX domain socket with the CSI driver's implementation of the CSI SnapshotMetadata gRPC service. The CSI driver service only handles the retrieval of the metadata requested; the sidecar is responsible for validating the Kubernetes authentication token, authorizing the backup application, validating the parameters of the RPC calls and fetching the provisioner secrets needed to complete the request. The sidecar forwards the RPC call to the CSI driver service over the UNIX domain socket, after translating Kubernetes object names into SP object names, and re-streams the results back to its client.
A CSI driver advertises the existence of the community sidecar's Kubernetes SnapshotMetadata gRPC service to Kubernetes backup applications by creating a SnapshotMetadataService CR that contains the service's TCP endpoint address, CA certificate and an audience string needed for token authentication. The CSI driver name is specified in a metadata label in this CR, so that a backup application can efficiently search for the Kubernetes SnapshotMetadata gRPC service of the provisioner of the VolumeSnapshots to be backed up.
Before accessing a
Kubernetes SnapshotMetadata gRPC service
a backup application must first obtain an authentication token using the Kubernetes
TokenRequest API
with the service's audience string.
It should establish trust with the specified CA for use in gRPC calls and
then directly make TLS gRPC calls to the
Kubernetes SnapshotMetadata
gRPC service's TCP endpoint.
The audience-scoped authentication token must be passed in the security_token
field of each RPC request message;
it will be used to authorize the backup application's use of the service.
Every RPC returns a gRPC stream through which the metadata can be recovered.
The process of accessing snapshot metadata via the sidecar is illustrated in the figure below. Additional information is available in the Design Details section.
A backup application needs to perform a full backup on volumes of a specific Kubernetes application.
For each volume in the application:
- The backup application creates a VolumeSnapshot of a PVC that needs to be backed up.
- The backup application queries the changed block tracking (CBT) service to identify all the allocated data blocks in the snapshot. The CBT service returns the list of allocated blocks.
- Using the VolumeSnapshot as the source, the backup application creates a new
PVC and mounts it with
Block
VolumeMode in a pod. - The backup application uses the CBT metadata to identify the data that needs to be backed up and reads these blocks from the mounted PVC in the pod.
A backup application needs to perform an incremental backup on volumes of a specific Kubernetes application. The backup application knows the identifiers of the VolumeSnapshots it had backed up previously.
For each volume in the application:
- The backup application creates a VolumeSnapshot of a PVC that needs to be backed up incrementally.
- The backup application queries the changed block tracking (CBT) service to identify the changes between the latest snapshot and the one it had previously backed up. The CBT service returns the list of blocks changed between the snapshots.
- Using the latest VolumeSnapshot as the source, the backup application creates a new
PVC and mounts it with
Block
VolumeMode in a pod. - The backup application uses the CBT metadata to find the only changed data to backup and reads these blocks from the mounted PVC in the pod.
-
This proposal requires a backup application to directly connect to a Kubernetes SnapshotMetadata gRPC service offered by the community provided external-snapshot-metadata sidecar deployed by a CSI driver. This was necessary to not place a load on the Kubernetes API server that would be proportional to the number of allocated blocks in a volume snapshot.
-
Each Kubernetes SnapshotMetadata gRPC service only operates on volume snapshots provisioned by the CSI driver that deploys the related sidecar. A backup application is responsible for locating the service for a CSI driver by searching for its SnapshotMetadataService CR. This search can fail as this service is optional.
-
A backup application must obtain a Kubernetes audience-scoped authentication token in order to use a Kubernetes SnapshotMetadata gRPC service. This requires that the backup application be authorized to use the Kubernetes TokenRequest API. The token can be obtained directly by a call to this API, or indirectly via a projected volume in the Pod used to access the Kubernetes SnapshotMetadata API.
-
The Kubernetes audience-scoped authentication token must be provided as the
security_token
field in each gRPC request message made by a backup application. -
The CSI SnapshotMetadata gRPC service RPC calls allow an application to restart an interrupted stream from where it previously failed by reissuing the RPC call with a starting byte offset. The same functionality is available through the Kubernetes SnapshotMetadata of the service.
-
The CSI SnapshotMetadata gRPC service permits metadata to be returned in either an extent or a block based format, at the discretion of the CSI driver. A portable backup application is expected to handle both such formats. This also applies to the Kubernetes SnapshotMetadata of the service.
-
The CSI SnapshotMetadata gRPC service must be capable of serving metadata on a VolumeSnapshot concurrently with the backup application's use of a PersistentVolume created on that same VolumeSnapshot. This is because a backup application would likely mount the PersistentVolume with
Block
VolumeMode in a Pod in order to read and archive the raw snapshot data blocks, and this read/archive loop will be driven by the stream of snapshot block metadata. -
The proposal does not specify how its security model is to be implemented. It is expected that the RBAC policies used by backup applications and the existing CSI drivers will be extended for this purpose.
A review by SIG-Auth (July 19, 2023) recommended the use of the TokenRequest and TokenReview APIs to make authentication and authorization checks possible between authorized Kubernetes principals.
The following risks are identified:
- Exposure of snapshot metadata by the use of the networked Kubernetes SnapshotMetadata gRPC API.
- Uncontrolled access to the Kubernetes SnapshotMetadata service could lead to denial-of-service attacks.
- A principal with the authority to use a Kubernetes SnapshotMetadata gRPC service indirectly gains access to the metadata of otherwise inaccessible VolumeSnapshots.
The risks are mitigated as follows:
-
The possible exposure of snapshot metadata by use of a network API is addressed by using encryption and mutual authentication for the direct gRPC call made by the backup application client. The gRPC client is required to first establish trust with the service's CA, and, while the direct TLS gRPC call itself does not perform mutual authentication, an audience-scoped authentication token must be passed as a parameter in each RPC call, which effectively provides the mechanism for the service to both authenticate and authorize the client.
The audience-scoped authentication token is obtained from the Kubernetes TokenRequest API and is validated with the Kubernetes TokenReview API. Its scope is narrowed to the just the target service (the "audience") by specifying an audience string defined by and unique to the service, during token creation. An authentication token has an expiry time so will not last forever; additionally, it can be bound to the Pod used by the backup application to access the service, to further constrain its effective lifetime.
-
Access to a Kubernetes SnapshotMetadata gRPC service, to the VolumeSnapshots referenced by through the service, and the ability to use the TokenRequest, TokenReview and SubjectAccessReview APIs are controlled by Kubernetes security policy.
The proposal requires the existence of security policy to establish the access rights described below, and illustrated in the following figure:
The proposal requires that Kubernetes security policy authorize access to:
-
The SnapshotMetadataService CR objects that advertise the existence of Kubernetes SnapshotMetadata gRPC services available in the cluster. These objects do not contain secret information so limiting access just controls the principals who obtain the service contact information. At the least, backup applications should be permitted to read these objects.
-
Backup applications must be granted permission to use the Kubernetes TokenRequest API in order to obtain the audience-scoped authentication tokens that are passed in each Kubernetes SnapshotMetadata gRPC service RPC call.
-
Backup applications must be granted access to view VolumeSnapshot objects in the target namespaces. Presumably they already have such permission if they were the ones initiating the creation of the VolumeSnapshot objects.
-
The CSI driver service account must be granted permission to use the Kubernetes TokenReview API in order for its Kubernetes SnapshotMetadata gRPC service to validate the authentication token.
-
The CSI driver must be granted permission to use the Kubernetes SubjectAccessReview API in order for its Kubernetes SnapshotMetadata gRPC service to validate that a security token authorizes access to VolumeSnapshot objects in a namespace.
-
The CSI driver presumably already has access rights to the VolumeSnapshot and VolumeSnapshot content objects as they are within its purview. This is needed for its Kubernetes SnapshotMetadata gRPC service.
The proposal does not specify how such a security policy is to be configured.
In this section we use the terminology of the gRPC specification where the word service is assumed to be a gRPC service and not a Kubernetes service, while the word plugin is used to refer to the software component that implements the gRPC service.
The CSI specification will be extended with the addition of the following new, optional SnapshotMetadata gRPC service. The SP Snapshot Metadata plugin implements this service.
The service is defined as follows, and will be described in the sub-sections below. Refer to CSI PR 551 for the official specification.
service SnapshotMetadata {
rpc GetMetadataAllocated(GetMetadataAllocatedRequest)
returns (stream GetMetadataAllocatedResponse) {}
rpc GetMetadataDelta(GetMetadataDeltaRequest)
returns (stream GetMetadataDeltaResponse) {}
}
enum BlockMetadataType {
UNKNOWN=0;
FIXED_LENGTH=1;
VARIABLE_LENGTH=2;
}
message BlockMetadata {
int64 byte_offset = 1;
int64 size_bytes = 2;
}
message GetMetadataAllocatedRequest {
string snapshot_id = 1;
int64 starting_offset = 2;
int32 max_results = 3;
map<string, string> secrets = 4;
}
message GetMetadataAllocatedResponse {
BlockMetadataType block_metadata_type = 1;
int64 volume_capacity_bytes = 2;
repeated BlockMetadata block_metadata = 3;
}
message GetMetadataDeltaRequest {
string base_snapshot_id = 1;
string target_snapshot_id = 2;
int64 starting_offset = 3;
int32 max_results = 4;
map<string, string> secrets = 5;
}
message GetMetadataDeltaResponse {
BlockMetadataType block_metadata_type = 1;
int64 volume_capacity_bytes = 2;
repeated BlockMetadata block_metadata = 3;
}
Block volume data ranges are specified by a sequence of (ByteOffset, Length)
tuples,
with the tuples in ascending order of ByteOffset
and no overlap between adjacent tuples.
There are two prevalent styles, extent-based or block-based,
which describe if the Length
field of the tuples in a sequence can
vary or are fixed across all the tuples in the sequence.
The SnapshotMetadata service permits either style at the discretion of the plugin,
and it is required that a client of this service be able to handle both styles.
The BlockMetadataType
enumeration specifies the style used: FIXED_LENGTH
or VARIABLE_LENGTH
.
When the block-based style (FIXED_LENGTH
) is used it is up to the SP plugin to define the
block size.
An individual tuple is identified by the BlockMetadata
message, and the sequence is
defined collectively across the tuple lists returned in the RPC
message stream.
Note that the plugin must ensure that the style is not change mid-stream in any given RPC invocation.
The GetMetadataAllocated
RPC returns metadata on the allocated blocks of a snapshot -
i.e. this identifies the data ranges that have valid data as they were the target of
some previous write operation.
Backup applications typically make an initial full backup of a volume followed
by a series of incremental backups, and the size of the initial full backup can
be reduced considerably if only the allocated blocks are saved.
The RPC's input arguments are specified by the GetMetadataAllocatedRequest
message,
and it returns a
stream
of GetMetadataAllocatedResponse
messages.
The fields of the GetMetadataAllocatedRequest
message are defined as follows:
-
snapshot_id
The identifier of a snapshot of the specified volume, in the nomenclature of the plugins. -
starting_offset
This specifies the 0 based starting byte position in the volume snapshot from which the result should be computed. It is intended to be used to continue a previously interrupted call. The plugins may round down this offset to the nearest alignment boundary based on theBlockMetadataType
it will use. -
max_results
This is an optional field. If non-zero it specifies the maximum length of theblock_metadata
list that the client wants to process in a givenGetAllocateResponse
element. The plugins will determine an appropriate value if 0, and is always free to send less than the requested maximum. -
secrets
This is an optional field. It should contain the provisioner secrets associated with the volume snapshot, if any. In Kubernetes such data is specified with the keyscsi.storage.k8s.io/snapshotter-secret-[name|namespace]
in theVolumeSnapshotClass.Parameters
field.
The fields of the GetMetadataAllocatedResponse
message are defined as follows:
-
block_metadata_type
This specifies the metadata format as described in the Metadata Format section above. -
volume_capacity_bytes
The size of the underlying volume, in bytes. -
block_metadata
This is a list ofBlockMetadata
tuples as described in the Metadata Format section above. The caller may request a maximum length of this list in themax_results
field of theGetMetadataAllocatedRequest
message, otherwise the length is determined by the plugins.
Note that while the block_metadata_type
and volume_capacity_bytes
fields are
repeated in each GetMetadataAllocatedResponse
message by the nature of the syntax of the
specification language, their values in a given RPC invocation must be constant.
i.e. a plugin is not free to modify these value mid-stream.
If the plugin is unable to complete the GetMetadataAllocated
call successfully it
must return a non-OK gRPC code in the gRPC status.
The following conditions are well defined:
Condition | gRPC Code | Description | Recovery Behavior |
---|---|---|---|
Missing or otherwise invalid argument | 3 INVALID_ARGUMENT | Indicates that a required argument field was not specified or an argument value is invalid | The caller should correct the error and resubmit the call. |
Invalid snapshot |
5 NOT_FOUND | Indicates that the snapshot specified was not found. | The caller should re-check that this object exists. |
Invalid starting_offset |
11 OUT_OF_RANGE | The starting offset exceeds the volume size. | The caller should specify a starting_offset less than the volume's size. |
The GetMetadataDelta
RPC returns the metadata on the blocks that have changed between
a pair of snapshots from the same volume.
The RPC's input arguments are specified by the GetMetadataDeltaRequest
message,
and it returns a
stream
of GetMetadataDeltaResponse
messages.
The fields of the GetMetadataDeltaRequest
message are defined as follows:
-
base_snapshot_id
The identifier of a snapshot of the specified volume, in the nomenclature of the plugins. -
target_snapshot_id
The identifier of a second snapshot of the specified volume, in the nomenclature of the plugins. This snapshot should have been created after the base snapshot, and the RPC will return the changes made since the base snapshot was created. -
starting_offset
This specifies the 0 based starting byte position in thetarget_snapshot
from which the result should be computed. It is intended to be used to continue a previously interrupted call. The plugins may round down this offset to the nearest alignment boundary based on theBlockMetadataType
it will use. -
max_results
This is an optional field. If non-zero it specifies the maximum length of theblock_metadata
list that the client wants to process in a givenGetMetadataDeltaResponse
element. The plugins will determine an appropriate value if 0, and is always free to send less than the requested maximum. -
secrets
This is an optional field. It should contain the provisioner secrets associated with the volume snapshot, if any. In Kubernetes such data is specified with the keyscsi.storage.k8s.io/snapshotter-secret-[name|namespace]
in theVolumeSnapshotClass.Parameters
field. This field should not be set by a Kubernetes backup application client.
The fields of the GetMetadataDeltaResponse
message are defined as follows:
-
block_metadata_type
This specifies the metadata format as described in the Metadata Format section above. -
volume_capacity_bytes
The size of the underlying volume, in bytes. -
block_metadata
This is a list ofBlockMetadata
tuples as described in the Metadata Format section above. The caller may request a maximum length of this list in themax_results
field of theGetMetadataDeltaRequest
message, otherwise the length is determined by the plugins.
Note that while the block_metadata_type
and volume_capacity_bytes
fields are
repeated in each GetMetadataDeltaResponse
message by the nature of the syntax of the
specification language, their values in a given RPC invocation must be constant.
i.e. a plugin is not free to modify these value mid-stream.
If the plugin is unable to complete the GetMetadataDelta
call successfully it
must return a non-OK gRPC code in the gRPC status.
The following conditions are well defined:
Condition | gRPC Code | Description | Recovery Behavior |
---|---|---|---|
Missing or otherwise invalid argument | 3 INVALID_ARGUMENT | Indicates that a required argument field was not specified or an argument value is invalid | The caller should correct the error and resubmit the call. |
Invalid base_snapshot or target_snapshot |
5 NOT_FOUND | Indicates that the snapshots specified were not found. | The caller should re-check that these objects exist. |
Invalid starting_offset |
11 OUT_OF_RANGE | The starting offset exceeds the volume size. | The caller should specify a starting_offset less than the volume's size. |
The following Kubernetes resources and components are involved at runtime:
-
The Kubernetes SnapshotMetadata Service API is used by Kubernetes clients to retrieve snapshot metadata.
-
A SnapshotMetadataService CR that advertises the existence of a Kubernetes SnapshotMetadata gRPC service.
-
A CSI driver vendor provided SP Snapshot Metadata Service that actually sources the snapshot metadata required through a CSI SnapshotMetadata gRPC service.
-
A community provided external-snapshot-metadata sidecar that provides a Kubernetes SnapshotMetadata gRPC service that interacts with Kubernetes clients and proxies TCP TLS gRPC requests from Kubernetes client applications to the CSI SnapshotMetadata gRPC service implemented by a SP Snapshot Metadata Service, over a UNIX domain transport.
The proposal minimizes the use of the Kubernetes API server when retrieving snapshot metadata. Instead, it calls for a Kubernetes client to directly make a gRPC connection with a community provided external-snapshot-metadata sidecar which will proxy the calls to an SP provided service that implements the CSI SnapshotMetadata Service API.
The proposal could have called for client to use the CSI SnapshotMetadata Service API itself when communicating with the sidecar, but there are a number of reasons to specify a different API, the Kubernetes SnapshotMetadata API. These reasons and any related modifications to the API are described below:
-
Introduction of a level of indirection between the CSI specification and the Kubernetes implementation to decouple the life-cycle of the external-snapshot-metadata sidecar from the life-cycle of the CSI specification.
-
The need to pass a Kubernetes audience-scoped authentication token to the sidecar. This is handled by introducing a
security_token
field in every request message. All Kubernetes SnapshotMetadata API calls will return theUNAUTHENTICATED
gRPC error code if this token is incorrect or adequate authority has not been granted to the invoker. -
Kubernetes Snapshots must be identified by
namespace
andname
, while the CSI specification uses a single identifier. This difference is handled by the addition of anamespace
parameter in every Kubernetes SnapshotMetadata API message that identifies a Snapshot, and the replacement of snapshot "id" parameters with "name" parameters. -
The CSI specification requires provisioner secrets be passed to the SP service if configured for the CSI driver. These secrets are out of the purview of a backup application and are fetched and inserted by the sidecar. As such, all
secrets
parameters are removed from the Kubernetes SnapshotMetadata API messages.
The proposed Kubernetes SnapshotMetadata API is very similar to the CSI SnapshotMetadata Service API. The only structural differences between the two specifications are in some message properties. The messages modified in the Kubernetes SnapshotMetadata API are shown below:
message GetMetadataAllocatedRequest {
string security_token = 1;
string namespace = 2;
string snapshot_name = 3;
int64 starting_offset = 4;
int32 max_results = 5;
}
message GetMetadataDeltaRequest {
string security_token = 1;
string namespace = 2;
string base_snapshot_name = 3;
string target_snapshot_name = 4;
int64 starting_offset = 5;
int32 max_results = 6;
}
The full specification of the Kubernetes SnapshotMetadata API will be published in the source code repository of the external-snapshot-metadata sidecar.
SnapshotMetadataService
is a cluster-scoped Custom Resource that defines the
existence of, and contains information needed to connect to the
Kubernetes SnapshotMetadata
gRPC service provided by the
external-snapshot-metadata sidecar
deployed by a CSI driver.
The CR name should be that of the associated CSI driver to ensure that only one such CR is created for a given driver.
The CR spec
contains the following fields:
address
Specifies the IP address or DNS name of the gRPC service. It should be provided in the format host:port, without specifying the scheme (e.g., http or https).caCert
Specifies the CA certificate used to enable TLS (Transport Layer Security) security for gRPC calls made to the service.audience
Specifies the audience string value expected in an audience-scoped authentication token presented to the sidecar by a Kubernetes client. The value should be unique to the service if possible; for example, it could be the DNS name of the service.
The full Custom Resource Definition is shown below:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.11.1
api-approved.kubernetes.io: unapproved
creationTimestamp: null
name: snapshotmetadataservices.cbt.storage.k8s.io
spec:
group: cbt.storage.k8s.io
names:
kind: SnapshotMetadataService
listKind: SnapshotMetadataServiceList
plural: snapshotmetadataservices
singular: snapshotmetadataservice
scope: Cluster
versions:
- name: v1alpha1
schema:
openAPIV3Schema:
description: 'The presence of a SnapshotMetadataService CR advertises the existence of a CSI
driver's Kubernetes SnapshotMetadata gRPC service.
An audience scoped Kubernetes authentication bearer token must be passed in the
"security_token" field of each gRPC call made by a Kubernetes backup client.'
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this representation
of an object. Servers should convert recognized schemas to the latest
internal value, and may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
type: string
kind:
description: 'Kind is a string value representing the REST resource this
object represents. Servers may infer this from the endpoint the client
submits requests to. Cannot be updated. In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
metadata:
type: object
spec:
description: This contains data needed to connect to a Kubernetes SnapshotMetadata gRPC service.
properties:
address:
type: string
description: The TCP endpoint address of the gRPC service.
audience:
type: string
description: The audience string value expected in a client's authentication token passed in the
"security_token" field of each gRPC call.
caCert:
description: Certificate authority bundle needed by the client to validate the service.
format: byte
type: string
type: object
type: object
served: true
storage: true
The SP must provide a container that implements the CSI SnapshotMetadata gRPC service that sources the snapshot metadata requested by a Kubernetes client. The service runs under the authority of the CSI driver ServiceAccount, which must be authorized as described in Risks and Mitigations.
The CSI driver should delegate all Kubernetes client interaction to the community provided external-snapshot-metadata sidecar which must be deployed in a pod alongside its service container, and configured to communicate with it over a UNIX domain socket. The sidecar will translate all Kubernetes object identifiers used by the Kubernetes client to SP identifiers before forwarding the remote procedure calls.
The SP service decides whether the metadata is returned in block-based format
(block_metadata_type
is FIXED_LENGTH
)
or extent-based format (block_metadata_type
is VARIABLE_LENGTH
).
In any given RPC call, the block_metadata_type
and volume_capacity_bytes
return
properties should be constant;
likewise the size_bytes
of all BlockMetadata
entries if the block_metadata_type
value returned is FIXED_LENGTH
.
The external-snapshot-metadata
sidecar is a community provided container
that handles all aspects of Kubernetes client interaction for
a SP Snapshot Metadata Service.
The sidecar should be configured to run under the authority of the CSI driver ServiceAccount,
which must be authorized as described in Risks and Mitigations.
A Service object must be created for the TCP based Kubernetes SnapshotMetadata gRPC service implemented by the sidecar. A SnapshotMetadataService CR must be created for the gRPC service within the sidecar. The CR contains the CA certificate and Service endpoint address of the sidecar and the audience string needed for the client's authentication token. The sidecar must be configured with the name of this CR object.
The sidecar must be deployed in the same pod as the SP Snapshot Metadata Service and must be configured to communicate with it through a UNIX domain socket.
The sidecar acts as a proxy for the SP Snapshot Metadata Service, handling all aspects of the Kubernetes client interaction described in this proposal, including
- Authenticating the client.
- Authorizing the client.
- Validating individual RPC arguments.
- Translating RPC arguments from the Kubernetes domain to the SP domain at runtime.
- Fetching the vendor secrets identified by the VolumeSnapshot object, if any.
A Kubernetes client must provide an audience scoped authentication token in the
security_token
field of every remote procedure call request message.
The sidecar will use the
TokenReview API
to validate this authentication token, using the audience string specified in the
SnapshotMetadataService CR.
If the client is authenticated, the sidecar will then use the returned UserInfo in the
SubjectAccessReview API
to verify that the client has access to VolumeSnapshot objects in the namespace
specified in the remote procedure call.
The sidecar will attempt to load the VolumeSnapshots specified along with their associated VolumeSnapshotContent objects, to ensure that they still exist, belong to the CSI driver, and to obtain their SP identifiers. Additional checks may be performed depending on the RPC; for example, in the case of a GetMetadataDelta RPC, it will check that all the snapshots come from the same volume and that the snapshot order is correct.
If all checks are successful, the RPC call is forwarded to the CSI SnapshotMetadata gRPC service in the SP Snapshot Metadata Service container over the UNIX domain socket, with its input parameters appropriately translated. The metadata result stream is sent back to the calling Kubernetes client.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
All unit tests will be included in the out-of-tree CSI repositories, with no impact on the test coverage of the core packages.
These include:
- Unit tests to cover the logic around retrieving the audience-scoped token and volume snapshot name and namespace from the gGRPC request metadata.
- Unit tests to cover the logic around getting the snapshot, snapshot content, snapshot class, volume handle, and metadata service resources.
- Unit tests to cover the logic around creating token review and subject access review resources.
- Unit tests to ensure no tokens/secrets are logged by k8s-csi logging methods.
No integration tests are required. This feature is better tested with e2e tests.
The prototype project of this KEP contains a sample gRPC client that can be
used to simulate gRPC requests to the SnapshotMetadata
service.
The e2e tests will test the SnapshotMetadata
service ability to:
- Handle and authenticate gRPC requests from the sample client to:
- Get allocated blocks of a PVC
- Get changed blocks between a pair of snapshots
- Get the following kinds of API resources, based on the gRPC requests and CSI
provisioner:
SnapshotMetadataService
TokenReview
SubjectAccessReview
VolumeSnapshot
VolumeSnapshotContent
- Marshal and stream responses from the mock SP service to the sample client
- Approval of the proposed CRDs and GRPC specification.
- The
SnapshotMetadata
service can handle gRPC requests to get allocated and changed blocks, and stream responses back to client, without imposing load on the K8s API server. - Initial e2e tests completed and enabled.
- Involve 2 different storage providers and backup applications in the
implementations of the
SnapshotMetadata
service to enable successful e2e backup workflow. - Increase e2e test coverage.
- Gather feedback from CSI driver maintainers and backup users, especially on performance metrics.
- CSI drivers include the
SnapshotMetadata
service as part of their distros. - Allowing time for user feedback.
If the external-snapshot-metadata
sidecar is added to an older CSI driver that
doesn't implement the CBT feature, the sidecar returns a gRPC UNIMPLEMENTED
status code (HTTP 501) to the backup application.
The snapshot metadata components support the v1 snapshot objects (VolumeSnapshot, VolumeSnapshotContent, VolumeSnapshotClass etc.), starting with snapshot controller v7.0. There are no plans to support earlier beta and alpha versions of the snapshot objects.
This enhancement focuses only on the control plane with no effects on the kubelet.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
- Components depending on the feature gate:
- Other
- Describe the mechanism: The new components will be implemented as part of the out-of-tree CSI framework. Storage providers can embed the sidecar component in their CSI drivers, if they choose to support this feature.
- Will enabling / disabling the feature require downtime of the control plane? No.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No.
No.
The feature can be disabled by removing the sidecar from the CSI driver.
No effects.
No.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
- API call type (e.g. PATCH pods):
- GET VolumeSnapshot, VolumeSnapshotContent, VolumeSnapshotClass, SnapshotMetadataService
- CREATE TokenRequest, TokenReview, SubjectAccessReview, SnapshotMetadataService
- estimated throughput: one object of each kind, per request
- originating component(s) (e.g. Kubelet, Feature-X-controller):
- user's backup application
- snapshot metadata sidecar
- components listing and/or watching resources they didn't before: N/A
- API calls that may be triggered by changes of some Kubernetes resources (e.g. update of object X triggers new updates of object Y): N/A. All API calls are initiated by the user's backup application
- periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.): N/A
- API type: SnapshotMetadataService
- Supported number of objects per cluster: One object for every CSI driver that supports CBT
- Supported number of objects per namespace (for namespace-scoped objects): N/A
The CSI driver plugin will call the cloud provider's CBT API to retrieve the CBT snapshot metadata.
Existing API objects will not be affected.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
The aggregated API server solution described in #3367 was deemed unsuitable because of the potentially large amount of CBT payloads that will be proxied through the K8s API server. Further discussion can be found in this thread.
An approach based on using volume populator to store the CBT payloads on intermediary storage, instead of sending them over the network was also considered. But the amount of pod creation/deletion churns and latency incurred made this solution inappropriate.
The previous design which involved generating and returning a RESTful callback endpoint to the caller, to serve CBT payloads was superceded by the aggregation extension mechanism as described in #3367, due to the requirement for more structured request and response payloads.