Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue thick-provisioning of RBD images on CreateVolume restart #2101

Merged
merged 2 commits into from
Jun 1, 2021

Conversation

nixpanic
Copy link
Member

Describe what this PR does

In case CreateVolume gets interrupted (by provisioner pod restart, or similar) the slow thick-provisioning is aborted. The RBD image will not have the thick-provisioned metadata set. Restarting CreateVolume will detect the RBD image, and return the volume which is not completely thick-provisioned.

By validating that the CreateVolume request intends to thick-provision the volume, it is possible to detect the need for the thick-provisioned metadata in the RBD image. When the metadata is missing, start allocating the image again.

Is there anything that requires special attention

If the RBD image has watchers (isInUse() returns true), assume that the allocating of the image is still in progress. Because thick-provisioning is slow, Kubernetes tends to time-out the initial CreateVolume request(s). It is expected that the not going to restart thick-provisioning of in-use image is reported when Kubernetes restarts the CreateVolume request.

Related issues

Fixes: https://bugzilla.redhat.com/1961647


Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)
  • /retest all: run this in case the CentOS CI failed to start/report any test
    progress or results

@nixpanic nixpanic added bug Something isn't working component/rbd Issues related to RBD backport-to-release-v3.3 labels May 26, 2021
@nixpanic nixpanic requested review from humblec and Madhu-1 May 26, 2021 09:41
@nixpanic nixpanic added this to the release-3.4.0 milestone May 26, 2021
Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's tricky do you have any plan to add E2E for this case?

background, err := rv.isInUse()
if err != nil {
return fmt.Errorf("failed to get users of %q: %w", rv, err)
} else if background {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to return the ABORT error message in this case? and also it's tricky we don't know who is holding the watcher on this image.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, and had the abort earlier. But I think it needs to be retried, as watchers may expired (provisioner lost its lease and an other took over?).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watcher has some timeout around 5 minutes (i don't recall the exact time). Will this delays the PVC creation by + watcher time and also next if the contents are not zeroed the whole process gets added as a overhead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all depends how the restarting of the provisioner is done, I think. On a normal restart (like upgrade), the Pod gets terminated and librbd will (force) close the connection. I do not expect any real delays because of that. In case the Pod/node becomes unreachable by Kubernetes, the allocating of the image can possibly continue on the non-leading provisioner. Retries on the (new) leading provisioner will assume that provisioning is continuing, which may be correct. If the non-leading provisioner is not doing anything anymore and the librbd connection was not closed, eventually there will be a timeout and re-provisioning will be done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my testing, by deleting the leading provisioner pod, the timeout before the watcher is removed is less than 5 minutes. Provisioning is still slow (writing many GB of data), but it continues once there are no more watchers on the image.

Copy link
Member Author

@nixpanic nixpanic May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an idea of the timelines, these are logs of the new leading provisioner, starting at the 1st (restarted) CreateVolume request:

I0527 08:36:47.637982       1 utils.go:163] ID: 44 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume
....
I0527 08:36:47.717254       1 rbd_journal.go:330] ID: 44 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 found existing volume (0001-0011-openshift-storage-0000000000000001-9e38478c-bec6-11eb-933f-0a580a80023c) with image name (csi-vol-9e38478c-bec6-11eb-933f-0a580a80023c) for request (pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7)
E0527 08:36:47.825434       1 utils.go:172] ID: 44 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC error: rpc error: code = Internal desc = not going to restart thick-provisioning of in-use image "ocs-storagecluster-cephblockpool/csi-vol-9e38478c-bec6-11eb-933f-0a580a80023c"
...
I0527 08:37:03.891402       1 utils.go:163] ID: 49 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume
...
I0527 08:37:03.965553       1 rbd_journal.go:330] ID: 49 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 found existing volume (0001-0011-openshift-storage-0000000000000001-9e38478c-bec6-11eb-933f-0a580a80023c) with image name (csi-vol-9e38478c-bec6-11eb-933f-0a580a80023c) for request (pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7)
...
... <gRPC ID 49 does not return yet, allocating continues>
...
I0527 08:39:49.898106       1 utils.go:163] ID: 53 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume
...
E0527 08:39:49.898473       1 controllerserver.go:234] ID: 53 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 an operation with the given Volume ID pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 already exists
E0527 08:39:49.898505       1 utils.go:172] ID: 53 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 already exists
I0527 08:40:21.907534       1 utils.go:163] ID: 54 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume
...
E0527 08:40:21.908003       1 controllerserver.go:234] ID: 54 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 an operation with the given Volume ID pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 already exists
...

All the above messages are expected, allocating the image is still in progress.

@nixpanic nixpanic requested a review from Madhu-1 May 26, 2021 11:40
@nixpanic nixpanic marked this pull request as draft May 26, 2021 14:17
@nixpanic
Copy link
Member Author

I know it's tricky do you have any plan to add E2E for this case?

I am not sure how to make this a reliable test, ideas welcome.

@nixpanic nixpanic force-pushed the rbd/thick-provisioning/restart branch from 31dbacc to b05ce42 Compare May 27, 2021 08:44
@nixpanic nixpanic marked this pull request as ready for review May 27, 2021 08:44
return nil
}

// in case there are watchers, assume allocating is still happening in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/allocating/allocation
btw, If there are no watchers on the image, how the allocation can still happen @nixpanic ? @idryomov do we expect such conditions in any stage ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocating - the action is in progress?

The leader provisioner Pod could have lost its lease for some reason, and the previously leading Pod may still be working on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nixpanic I think the point here is mainly on this: how we (could land)/landed on more than one watcher and if we land to a situation how would we identify what caused it, to make sure or validate whether is ongoing provisioning is a genuine attempt happening or a left over from a stale request? If disturbances like leader reelection causes it ( I doubt that could leave such state) , If its from a stale/unmapped request for the CO, the operation may succeed in background , but we are going to retry the same in next request with new leader or whoever called, its going to be an overlap or unwanted second attempt. Suppose the first request caused the watcher to be there, but its no longer functioning , is this watcher get removed automatically after some time ? if not, we cant operate on this volume any more, that would also leave a stale volume in the backend as we are not doing any cleanup here. I am just trying to understand the workflow in a bad situation , however I think its good to run through so that we are better prepared.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the watcher gets removed automatically. See the log snippet I added for #2101 (comment) where the 2nd provisioner restarts the CreateVolume request and re-allocates the RBD image. Initially the watch is still present and allocating aborts, but the retries for CreateVolume continue shortly after.

Copy link
Contributor

@idryomov idryomov May 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An inactive watch should be cleaned up automatically after 30 seconds.

// RepairThickProvision writes zero bytes to the volume so that it will be
// completely allocated. In case the volume is already marked as
// thick-provisioned, nothing will be done.
func (rv *rbdVolume) RepairThickProvision() error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict, this function does the complete process again. I mean the writing of zero bytes from start, please correct me if I am wrong here @nixpanic . If the CO was aborted due to timeout in first or previous cycles, this is going to be mostly the same result in subsequent calls too. In short, we land into looping of the requests. Rather I was thinking is there a way to know the progress of previous attempt and may be proceed from there? if yes, it could help us to speed up the operation and also not landing into the same situation again and again @nixpanic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the process keeps on getting interrupted, there will be little progress made. We might be able to improve things by continuing where the previously interrupted allocating was aborted. There are ways to track progress, and it is possible to detect what parts of an image are allocated. However, that is not trivial and can be done as an enhancement later.

Remember that restarting the allocating is only done when the provisioner was restarted or an other fatal error occurred. Optimizing behaviour for error cases is not a high priority.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nixpanic do we need to keep a tracker issue for this ?

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a suggestion left for possible improvement

@nixpanic nixpanic requested a review from humblec May 27, 2021 09:49
Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the Pod/node becomes unreachable by Kubernetes, the allocating of the image can possibly continue on the non-leading provisioner. Retries on the (new) leading provisioner will assume that provisioning is continuing, which may be correct. If the non-leading provisioner is not doing anything anymore and the librbd connection was not closed, eventually there will be a timeout and re-provisioning will be done.

I'm a little scared by the mention of non-leading provisioner(s) and the general reliance on watches for checking whether the image is in use. I have raised this in some fencing-related issues and PRs before, but once again: while a presence of a watch can be used to tell whether the image is in use (best effort check), the reverse is not true. A lack of a watch is not a reliable indicator of pretty much anything -- the image may still be opened and written to!

In this case, what would prevent the non-leading provisioner that is not reachable by k8s from waking up and continuing its zeroing after the image has been (re)zeroed by the now leading provisioner and handed off to the user? This seems like a potential data corruption scenario to me.

@nixpanic
Copy link
Member Author

@idryomov: when a watch is lost (initially added once the image was opened), can the opened image still be used for writing? I expected that the rbd-image-handle would become invalid, and writing returns a failure.

If there is a better recommendation than to check watches on RBD images, than we can surely implement it that way. Until there is a usable alternative, we probably will need to stick with the watchers for now.

@idryomov
Copy link
Contributor

@idryomov: when a watch is lost (initially added once the image was opened), can the opened image still be used for writing? I expected that the rbd-image-handle would become invalid, and writing returns a failure.

Yes. A lost watch would be reestablished in due course. Also, it may not know that the watch is lost for a while, etc.

If there is a better recommendation than to check watches on RBD images, than we can surely implement it that way. Until there is a usable alternative, we probably will need to stick with the watchers for now.

Ultimately it has to be something that would fence other potential writers at the Ceph level (ceph osd blocklist) before taking over.

It seems to be another instance of #578 and similar issues. Whereas before the only concern was the entity that mapped the image (since it was the only potential writer in the system), now that the provisioner can also write to the image it needs to be treated as a potential writer and be subjected to the same safeguards.

@humblec
Copy link
Collaborator

humblec commented May 31, 2021

Ultimately it has to be something that would fence other potential writers at the Ceph level (ceph osd blocklist) before taking over.

It seems to be another instance of #578 and similar issues. Whereas before the only concern was the entity that mapped the image (since it was the only potential writer in the system), now that the provisioner can also write to the image it needs to be treated as a potential writer and be subjected to the same safeguards.

Sure, ultimately fencing could help us here. Till we have such mechanism, I think we have to survive with watchers . Considering this PR improves the situation of thick provisioning to a good extent, it looks like its good to get this in and implement more safe guards accordingly in near future.
LGTM.

@idryomov do you have final comments on this PR ? if not, we will go ahead and merge this 👍 . Thanks for reviewing!

@idryomov
Copy link
Contributor

Nothing substantial.

Honestly, the entire scheme with k8s timing out create requests one after another and firing retries at random provisioner instances while zeroing is in progress seems very clumsy to me. I just hope that thick provisioning isn't going to be widely used because otherwise the current implementation will bite pretty soon.

@humblec
Copy link
Collaborator

humblec commented Jun 1, 2021

Nothing substantial.

Honestly, the entire scheme with k8s timing out create requests one after another and firing retries at random provisioner instances while zeroing is in progress seems very clumsy to me.

I was wondering if we could resume from where previous attempt halted ( if we can get progression complete : #2101 (comment)) it could surface this problem to an extent. @idryomov is it something we can get supporte/improve from Ceph side ? especially for operations like --thick provision..etc?

I just hope that thick provisioning isn't going to be widely used because otherwise the current implementation will bite pretty soon.

As per current reports from the field, It looks like this is going to be used in a heavy extent. This is becoming default choice for VM usecases.

@idryomov
Copy link
Contributor

idryomov commented Jun 1, 2021

I was wondering if we could resume from where previous attempt halted ( if we can get progression complete : #2101 (comment)) it could surface this problem to an extent. @idryomov is it something we can get supporte/improve from Ceph side ? especially for operations like --thick provision..etc?

Theoretically you could infer where zeroing got interrupted by looking at the provisioned size (i.e. how many rbd_data.<image id> objects are there) or store a marker alongside your thick-provisioned metadata after every X gigabytes are zeroed. But this is just as a potential optimization, not a solution for the data corruption scenario.

I just hope that thick provisioning isn't going to be widely used because otherwise the current implementation will bite pretty soon.

As per current reports from the field, It looks like this is going to be used in a heavy extent. This is becoming default choice for VM usecases.

I wonder why? Do you have the details or some links I could peruse?

@humblec
Copy link
Collaborator

humblec commented Jun 1, 2021

I just hope that thick provisioning isn't going to be widely used because otherwise the current implementation will bite pretty soon.

As per current reports from the field, It looks like this is going to be used in a heavy extent. This is becoming default choice for VM usecases.

I wonder why? Do you have the details or some links I could peruse?

@oritwas @mykaul @obnoxxx if possible, can you please share the details of thick provision use cases?

@nixpanic
Copy link
Member Author

nixpanic commented Jun 1, 2021

As per current reports from the field, It looks like this is going to be used in a heavy extent. This is becoming default choice for VM usecases.

My impression is that this is mainly provided for users that need more accurate accounting for storage consumption. When an environment is sufficiently planned for the expected workloads, thin-provisioning would surely be the recommendation. It might be difficult to estimate actual storage consumption for VM workloads, but any deployment should make sure that sufficient capacity is available, thick-provisioned or thin.

Thick-provisioning can be used as a safety-net when an environment does not have sufficient storage capacity for the planned workload. In that case thick-provisioning can prevent workloads from starting and having a storage cluster running out of space (where recovering becomes very difficult).

@humblec
Copy link
Collaborator

humblec commented Jun 1, 2021

As discussed in today's call, lets go ahead with this interim optimization now and catch up the improvements on the way..

nixpanic added 2 commits June 1, 2021 13:12
Move the repairing of a volume/snapshot from CreateVolume to its own
function. This reduces the complexity of the code, and makes the
procedure easier to understand. Further enhancements to repairing an
exsiting volume can be done in the new function.

Signed-off-by: Niels de Vos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/rbd Issues related to RBD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants