Continue thick-provisioning of RBD images on CreateVolume restart #2101

nixpanic · 2021-05-26T09:41:51Z

Describe what this PR does

In case CreateVolume gets interrupted (by provisioner pod restart, or similar) the slow thick-provisioning is aborted. The RBD image will not have the thick-provisioned metadata set. Restarting CreateVolume will detect the RBD image, and return the volume which is not completely thick-provisioned.

By validating that the CreateVolume request intends to thick-provision the volume, it is possible to detect the need for the thick-provisioned metadata in the RBD image. When the metadata is missing, start allocating the image again.

Is there anything that requires special attention

If the RBD image has watchers (isInUse() returns true), assume that the allocating of the image is still in progress. Because thick-provisioning is slow, Kubernetes tends to time-out the initial CreateVolume request(s). It is expected that the not going to restart thick-provisioning of in-use image is reported when Kubernetes restarts the CreateVolume request.

Related issues

Fixes: https://bugzilla.redhat.com/1961647

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

/retest ci/centos/<job-name>: retest the <job-name> after unrelated
failure (please report the failure too!)
/retest all: run this in case the CentOS CI failed to start/report any test
progress or results

Madhu-1

I know it's tricky do you have any plan to add E2E for this case?

internal/rbd/controllerserver.go

internal/rbd/rbd_util.go

Madhu-1 · 2021-05-26T10:13:30Z

internal/rbd/rbd_util.go

+	background, err := rv.isInUse()
+	if err != nil {
+		return fmt.Errorf("failed to get users of %q: %w", rv, err)
+	} else if background {


do we need to return the ABORT error message in this case? and also it's tricky we don't know who is holding the watcher on this image.

I thought about that, and had the abort earlier. But I think it needs to be retried, as watchers may expired (provisioner lost its lease and an other took over?).

The watcher has some timeout around 5 minutes (i don't recall the exact time). Will this delays the PVC creation by + watcher time and also next if the contents are not zeroed the whole process gets added as a overhead?

It all depends how the restarting of the provisioner is done, I think. On a normal restart (like upgrade), the Pod gets terminated and librbd will (force) close the connection. I do not expect any real delays because of that. In case the Pod/node becomes unreachable by Kubernetes, the allocating of the image can possibly continue on the non-leading provisioner. Retries on the (new) leading provisioner will assume that provisioning is continuing, which may be correct. If the non-leading provisioner is not doing anything anymore and the librbd connection was not closed, eventually there will be a timeout and re-provisioning will be done.

in my testing, by deleting the leading provisioner pod, the timeout before the watcher is removed is less than 5 minutes. Provisioning is still slow (writing many GB of data), but it continues once there are no more watchers on the image.

For an idea of the timelines, these are logs of the new leading provisioner, starting at the 1st (restarted) CreateVolume request:

I0527 08:36:47.637982 1 utils.go:163] ID: 44 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume .... I0527 08:36:47.717254 1 rbd_journal.go:330] ID: 44 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 found existing volume (0001-0011-openshift-storage-0000000000000001-9e38478c-bec6-11eb-933f-0a580a80023c) with image name (csi-vol-9e38478c-bec6-11eb-933f-0a580a80023c) for request (pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7) E0527 08:36:47.825434 1 utils.go:172] ID: 44 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC error: rpc error: code = Internal desc = not going to restart thick-provisioning of in-use image "ocs-storagecluster-cephblockpool/csi-vol-9e38478c-bec6-11eb-933f-0a580a80023c" ... I0527 08:37:03.891402 1 utils.go:163] ID: 49 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume ... I0527 08:37:03.965553 1 rbd_journal.go:330] ID: 49 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 found existing volume (0001-0011-openshift-storage-0000000000000001-9e38478c-bec6-11eb-933f-0a580a80023c) with image name (csi-vol-9e38478c-bec6-11eb-933f-0a580a80023c) for request (pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7) ... ... <gRPC ID 49 does not return yet, allocating continues> ... I0527 08:39:49.898106 1 utils.go:163] ID: 53 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume ... E0527 08:39:49.898473 1 controllerserver.go:234] ID: 53 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 an operation with the given Volume ID pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 already exists E0527 08:39:49.898505 1 utils.go:172] ID: 53 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 already exists I0527 08:40:21.907534 1 utils.go:163] ID: 54 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume ... E0527 08:40:21.908003 1 controllerserver.go:234] ID: 54 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 an operation with the given Volume ID pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 already exists ...

All the above messages are expected, allocating the image is still in progress.

nixpanic · 2021-05-27T08:42:56Z

I know it's tricky do you have any plan to add E2E for this case?

I am not sure how to make this a reliable test, ideas welcome.

humblec · 2021-05-27T09:19:05Z

internal/rbd/rbd_util.go

+		return nil
+	}
+
+	// in case there are watchers, assume allocating is still happening in


nit: s/allocating/allocation
btw, If there are no watchers on the image, how the allocation can still happen @nixpanic ? @idryomov do we expect such conditions in any stage ?

allocating - the action is in progress?

The leader provisioner Pod could have lost its lease for some reason, and the previously leading Pod may still be working on it.

@nixpanic I think the point here is mainly on this: how we (could land)/landed on more than one watcher and if we land to a situation how would we identify what caused it, to make sure or validate whether is ongoing provisioning is a genuine attempt happening or a left over from a stale request? If disturbances like leader reelection causes it ( I doubt that could leave such state) , If its from a stale/unmapped request for the CO, the operation may succeed in background , but we are going to retry the same in next request with new leader or whoever called, its going to be an overlap or unwanted second attempt. Suppose the first request caused the watcher to be there, but its no longer functioning , is this watcher get removed automatically after some time ? if not, we cant operate on this volume any more, that would also leave a stale volume in the backend as we are not doing any cleanup here. I am just trying to understand the workflow in a bad situation , however I think its good to run through so that we are better prepared.

Yes, the watcher gets removed automatically. See the log snippet I added for #2101 (comment) where the 2nd provisioner restarts the CreateVolume request and re-allocates the RBD image. Initially the watch is still present and allocating aborts, but the retries for CreateVolume continue shortly after.

An inactive watch should be cleaned up automatically after 30 seconds.

humblec · 2021-05-27T09:23:18Z

internal/rbd/rbd_util.go

+// RepairThickProvision writes zero bytes to the volume so that it will be
+// completely allocated. In case the volume is already marked as
+// thick-provisioned, nothing will be done.
+func (rv *rbdVolume) RepairThickProvision() error {


afaict, this function does the complete process again. I mean the writing of zero bytes from start, please correct me if I am wrong here @nixpanic . If the CO was aborted due to timeout in first or previous cycles, this is going to be mostly the same result in subsequent calls too. In short, we land into looping of the requests. Rather I was thinking is there a way to know the progress of previous attempt and may be proceed from there? if yes, it could help us to speed up the operation and also not landing into the same situation again and again @nixpanic

Yes, if the process keeps on getting interrupted, there will be little progress made. We might be able to improve things by continuing where the previously interrupted allocating was aborted. There are ways to track progress, and it is possible to detect what parts of an image are allocated. However, that is not trivial and can be done as an enhancement later.

Remember that restarting the allocating is only done when the provisioner was restarted or an other fatal error occurred. Optimizing behaviour for error cases is not a high priority.

@nixpanic do we need to keep a tracker issue for this ?

internal/rbd/controllerserver.go

Madhu-1

LGTM, just a suggestion left for possible improvement

idryomov

In case the Pod/node becomes unreachable by Kubernetes, the allocating of the image can possibly continue on the non-leading provisioner. Retries on the (new) leading provisioner will assume that provisioning is continuing, which may be correct. If the non-leading provisioner is not doing anything anymore and the librbd connection was not closed, eventually there will be a timeout and re-provisioning will be done.

I'm a little scared by the mention of non-leading provisioner(s) and the general reliance on watches for checking whether the image is in use. I have raised this in some fencing-related issues and PRs before, but once again: while a presence of a watch can be used to tell whether the image is in use (best effort check), the reverse is not true. A lack of a watch is not a reliable indicator of pretty much anything -- the image may still be opened and written to!

In this case, what would prevent the non-leading provisioner that is not reachable by k8s from waking up and continuing its zeroing after the image has been (re)zeroed by the now leading provisioner and handed off to the user? This seems like a potential data corruption scenario to me.

nixpanic · 2021-05-27T11:24:38Z

@idryomov: when a watch is lost (initially added once the image was opened), can the opened image still be used for writing? I expected that the rbd-image-handle would become invalid, and writing returns a failure.

If there is a better recommendation than to check watches on RBD images, than we can surely implement it that way. Until there is a usable alternative, we probably will need to stick with the watchers for now.

idryomov · 2021-05-29T18:42:38Z

@idryomov: when a watch is lost (initially added once the image was opened), can the opened image still be used for writing? I expected that the rbd-image-handle would become invalid, and writing returns a failure.

Yes. A lost watch would be reestablished in due course. Also, it may not know that the watch is lost for a while, etc.

If there is a better recommendation than to check watches on RBD images, than we can surely implement it that way. Until there is a usable alternative, we probably will need to stick with the watchers for now.

Ultimately it has to be something that would fence other potential writers at the Ceph level (ceph osd blocklist) before taking over.

It seems to be another instance of #578 and similar issues. Whereas before the only concern was the entity that mapped the image (since it was the only potential writer in the system), now that the provisioner can also write to the image it needs to be treated as a potential writer and be subjected to the same safeguards.

humblec · 2021-05-31T14:25:04Z

Ultimately it has to be something that would fence other potential writers at the Ceph level (ceph osd blocklist) before taking over.

It seems to be another instance of #578 and similar issues. Whereas before the only concern was the entity that mapped the image (since it was the only potential writer in the system), now that the provisioner can also write to the image it needs to be treated as a potential writer and be subjected to the same safeguards.

Sure, ultimately fencing could help us here. Till we have such mechanism, I think we have to survive with watchers . Considering this PR improves the situation of thick provisioning to a good extent, it looks like its good to get this in and implement more safe guards accordingly in near future.
LGTM.

@idryomov do you have final comments on this PR ? if not, we will go ahead and merge this 👍 . Thanks for reviewing!

idryomov · 2021-05-31T19:00:08Z

Nothing substantial.

Honestly, the entire scheme with k8s timing out create requests one after another and firing retries at random provisioner instances while zeroing is in progress seems very clumsy to me. I just hope that thick provisioning isn't going to be widely used because otherwise the current implementation will bite pretty soon.

humblec · 2021-06-01T05:25:47Z

Nothing substantial.

Honestly, the entire scheme with k8s timing out create requests one after another and firing retries at random provisioner instances while zeroing is in progress seems very clumsy to me.

I was wondering if we could resume from where previous attempt halted ( if we can get progression complete : #2101 (comment)) it could surface this problem to an extent. @idryomov is it something we can get supporte/improve from Ceph side ? especially for operations like --thick provision..etc?

I just hope that thick provisioning isn't going to be widely used because otherwise the current implementation will bite pretty soon.

As per current reports from the field, It looks like this is going to be used in a heavy extent. This is becoming default choice for VM usecases.

idryomov · 2021-06-01T08:08:28Z

I was wondering if we could resume from where previous attempt halted ( if we can get progression complete : #2101 (comment)) it could surface this problem to an extent. @idryomov is it something we can get supporte/improve from Ceph side ? especially for operations like --thick provision..etc?

Theoretically you could infer where zeroing got interrupted by looking at the provisioned size (i.e. how many rbd_data.<image id> objects are there) or store a marker alongside your thick-provisioned metadata after every X gigabytes are zeroed. But this is just as a potential optimization, not a solution for the data corruption scenario.

I just hope that thick provisioning isn't going to be widely used because otherwise the current implementation will bite pretty soon.

As per current reports from the field, It looks like this is going to be used in a heavy extent. This is becoming default choice for VM usecases.

I wonder why? Do you have the details or some links I could peruse?

humblec · 2021-06-01T08:40:07Z

I just hope that thick provisioning isn't going to be widely used because otherwise the current implementation will bite pretty soon.

As per current reports from the field, It looks like this is going to be used in a heavy extent. This is becoming default choice for VM usecases.

I wonder why? Do you have the details or some links I could peruse?

@oritwas @mykaul @obnoxxx if possible, can you please share the details of thick provision use cases?

nixpanic · 2021-06-01T11:08:33Z

As per current reports from the field, It looks like this is going to be used in a heavy extent. This is becoming default choice for VM usecases.

My impression is that this is mainly provided for users that need more accurate accounting for storage consumption. When an environment is sufficiently planned for the expected workloads, thin-provisioning would surely be the recommendation. It might be difficult to estimate actual storage consumption for VM workloads, but any deployment should make sure that sufficient capacity is available, thick-provisioned or thin.

Thick-provisioning can be used as a safety-net when an environment does not have sufficient storage capacity for the planned workload. In that case thick-provisioning can prevent workloads from starting and having a storage cluster running out of space (where recovering becomes very difficult).

humblec · 2021-06-01T13:10:52Z

As discussed in today's call, lets go ahead with this interim optimization now and catch up the improvements on the way..

Move the repairing of a volume/snapshot from CreateVolume to its own function. This reduces the complexity of the code, and makes the procedure easier to understand. Further enhancements to repairing an exsiting volume can be done in the new function. Signed-off-by: Niels de Vos <[email protected]>

Signed-off-by: Niels de Vos <[email protected]>

nixpanic added bug Something isn't working component/rbd Issues related to RBD backport-to-release-v3.3 labels May 26, 2021

nixpanic requested review from humblec and Madhu-1 May 26, 2021 09:41

nixpanic added this to the release-3.4.0 milestone May 26, 2021

Madhu-1 reviewed May 26, 2021

View reviewed changes

nixpanic requested a review from Madhu-1 May 26, 2021 11:40

nixpanic marked this pull request as draft May 26, 2021 14:17

nixpanic force-pushed the rbd/thick-provisioning/restart branch from 31dbacc to b05ce42 Compare May 27, 2021 08:44

nixpanic marked this pull request as ready for review May 27, 2021 08:44

humblec reviewed May 27, 2021

View reviewed changes

Madhu-1 reviewed May 27, 2021

View reviewed changes

internal/rbd/controllerserver.go Show resolved Hide resolved

Madhu-1 approved these changes May 27, 2021

View reviewed changes

nixpanic requested a review from humblec May 27, 2021 09:49

idryomov reviewed May 27, 2021

View reviewed changes

humblec approved these changes Jun 1, 2021

View reviewed changes

nixpanic added 2 commits June 1, 2021 13:12

rbd: repair thick-provisioned images on CreateVolume restart

532ca58

Signed-off-by: Niels de Vos <[email protected]>

nixpanic force-pushed the rbd/thick-provisioning/restart branch from b05ce42 to 532ca58 Compare June 1, 2021 13:12

mergify bot merged commit 7cbad93 into ceph:devel Jun 1, 2021

mergify bot mentioned this pull request Jun 1, 2021

Continue thick-provisioning of RBD images on CreateVolume restart (backport #2101) #2116

Merged

This was referenced Jun 2, 2021

Sync with upstream devel branch openshift/ceph-csi#51

Merged

BUG 1961647: Continue thick-provisioning of RBD images on CreateVolume restart openshift/ceph-csi#52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue thick-provisioning of RBD images on CreateVolume restart #2101

Continue thick-provisioning of RBD images on CreateVolume restart #2101

nixpanic commented May 26, 2021

Madhu-1 left a comment

Madhu-1 May 26, 2021

nixpanic May 26, 2021

Madhu-1 May 26, 2021

nixpanic May 27, 2021

nixpanic May 27, 2021

nixpanic May 27, 2021 •

edited

Loading

nixpanic commented May 27, 2021

humblec May 27, 2021

nixpanic May 27, 2021

humblec May 27, 2021

nixpanic May 27, 2021

idryomov May 29, 2021 •

edited

Loading

humblec May 27, 2021

nixpanic May 27, 2021

humblec May 31, 2021

Madhu-1 left a comment

idryomov left a comment •

edited

Loading

nixpanic commented May 27, 2021

idryomov commented May 29, 2021

humblec commented May 31, 2021 •

edited

Loading

idryomov commented May 31, 2021

humblec commented Jun 1, 2021

idryomov commented Jun 1, 2021

humblec commented Jun 1, 2021 •

edited

Loading

nixpanic commented Jun 1, 2021

humblec commented Jun 1, 2021

Continue thick-provisioning of RBD images on CreateVolume restart #2101

Continue thick-provisioning of RBD images on CreateVolume restart #2101

Conversation

nixpanic commented May 26, 2021

Describe what this PR does

Is there anything that requires special attention

Related issues

Madhu-1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nixpanic May 27, 2021 • edited Loading

Choose a reason for hiding this comment

nixpanic commented May 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idryomov May 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Madhu-1 left a comment

Choose a reason for hiding this comment

idryomov left a comment • edited Loading

Choose a reason for hiding this comment

nixpanic commented May 27, 2021

idryomov commented May 29, 2021

humblec commented May 31, 2021 • edited Loading

idryomov commented May 31, 2021

humblec commented Jun 1, 2021

idryomov commented Jun 1, 2021

humblec commented Jun 1, 2021 • edited Loading

nixpanic commented Jun 1, 2021

humblec commented Jun 1, 2021

nixpanic May 27, 2021 •

edited

Loading

idryomov May 29, 2021 •

edited

Loading

idryomov left a comment •

edited

Loading

humblec commented May 31, 2021 •

edited

Loading

humblec commented Jun 1, 2021 •

edited

Loading