-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continue thick-provisioning of RBD images on CreateVolume restart #2101
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's tricky do you have any plan to add E2E for this case?
background, err := rv.isInUse() | ||
if err != nil { | ||
return fmt.Errorf("failed to get users of %q: %w", rv, err) | ||
} else if background { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to return the ABORT error message in this case? and also it's tricky we don't know who is holding the watcher on this image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, and had the abort earlier. But I think it needs to be retried, as watchers may expired (provisioner lost its lease and an other took over?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The watcher has some timeout around 5 minutes (i don't recall the exact time). Will this delays the PVC creation by + watcher time and also next if the contents are not zeroed the whole process gets added as a overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all depends how the restarting of the provisioner is done, I think. On a normal restart (like upgrade), the Pod gets terminated and librbd will (force) close the connection. I do not expect any real delays because of that. In case the Pod/node becomes unreachable by Kubernetes, the allocating of the image can possibly continue on the non-leading provisioner. Retries on the (new) leading provisioner will assume that provisioning is continuing, which may be correct. If the non-leading provisioner is not doing anything anymore and the librbd connection was not closed, eventually there will be a timeout and re-provisioning will be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my testing, by deleting the leading provisioner pod, the timeout before the watcher is removed is less than 5 minutes. Provisioning is still slow (writing many GB of data), but it continues once there are no more watchers on the image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an idea of the timelines, these are logs of the new leading provisioner, starting at the 1st (restarted) CreateVolume request:
I0527 08:36:47.637982 1 utils.go:163] ID: 44 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume
....
I0527 08:36:47.717254 1 rbd_journal.go:330] ID: 44 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 found existing volume (0001-0011-openshift-storage-0000000000000001-9e38478c-bec6-11eb-933f-0a580a80023c) with image name (csi-vol-9e38478c-bec6-11eb-933f-0a580a80023c) for request (pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7)
E0527 08:36:47.825434 1 utils.go:172] ID: 44 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC error: rpc error: code = Internal desc = not going to restart thick-provisioning of in-use image "ocs-storagecluster-cephblockpool/csi-vol-9e38478c-bec6-11eb-933f-0a580a80023c"
...
I0527 08:37:03.891402 1 utils.go:163] ID: 49 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume
...
I0527 08:37:03.965553 1 rbd_journal.go:330] ID: 49 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 found existing volume (0001-0011-openshift-storage-0000000000000001-9e38478c-bec6-11eb-933f-0a580a80023c) with image name (csi-vol-9e38478c-bec6-11eb-933f-0a580a80023c) for request (pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7)
...
... <gRPC ID 49 does not return yet, allocating continues>
...
I0527 08:39:49.898106 1 utils.go:163] ID: 53 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume
...
E0527 08:39:49.898473 1 controllerserver.go:234] ID: 53 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 an operation with the given Volume ID pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 already exists
E0527 08:39:49.898505 1 utils.go:172] ID: 53 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 already exists
I0527 08:40:21.907534 1 utils.go:163] ID: 54 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 GRPC call: /csi.v1.Controller/CreateVolume
...
E0527 08:40:21.908003 1 controllerserver.go:234] ID: 54 Req-ID: pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 an operation with the given Volume ID pvc-af7278e3-b78d-4ff2-b0a3-028358b465b7 already exists
...
All the above messages are expected, allocating the image is still in progress.
I am not sure how to make this a reliable test, ideas welcome. |
31dbacc
to
b05ce42
Compare
return nil | ||
} | ||
|
||
// in case there are watchers, assume allocating is still happening in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocating - the action is in progress?
The leader provisioner Pod could have lost its lease for some reason, and the previously leading Pod may still be working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nixpanic I think the point here is mainly on this: how we (could land)/landed on more than one watcher
and if we land to a situation how would we identify what caused it, to make sure or validate whether is ongoing provisioning is a genuine attempt happening or a left over from a stale request? If disturbances like leader reelection causes it ( I doubt that could leave such state) , If its from a stale/unmapped request for the CO, the operation may succeed in background , but we are going to retry the same in next request with new leader or whoever called, its going to be an overlap or unwanted second attempt. Suppose the first request caused the watcher to be there, but its no longer functioning , is this watcher get removed automatically after some time ? if not, we cant operate on this volume any more, that would also leave a stale volume in the backend as we are not doing any cleanup here. I am just trying to understand the workflow in a bad situation , however I think its good to run through so that we are better prepared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the watcher gets removed automatically. See the log snippet I added for #2101 (comment) where the 2nd provisioner restarts the CreateVolume request and re-allocates the RBD image. Initially the watch is still present and allocating aborts, but the retries for CreateVolume continue shortly after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An inactive watch should be cleaned up automatically after 30 seconds.
// RepairThickProvision writes zero bytes to the volume so that it will be | ||
// completely allocated. In case the volume is already marked as | ||
// thick-provisioned, nothing will be done. | ||
func (rv *rbdVolume) RepairThickProvision() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaict, this function does the complete process again. I mean the writing of zero bytes from start, please correct me if I am wrong here @nixpanic . If the CO was aborted due to timeout in first or previous cycles, this is going to be mostly the same result in subsequent calls too. In short, we land into looping
of the requests. Rather I was thinking is there a way to know the progress of previous attempt and may be proceed from there? if yes, it could help us to speed up the operation and also not landing into the same situation again and again @nixpanic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if the process keeps on getting interrupted, there will be little progress made. We might be able to improve things by continuing where the previously interrupted allocating was aborted. There are ways to track progress, and it is possible to detect what parts of an image are allocated. However, that is not trivial and can be done as an enhancement later.
Remember that restarting the allocating is only done when the provisioner was restarted or an other fatal error occurred. Optimizing behaviour for error cases is not a high priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nixpanic do we need to keep a tracker issue for this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a suggestion left for possible improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the Pod/node becomes unreachable by Kubernetes, the allocating of the image can possibly continue on the non-leading provisioner. Retries on the (new) leading provisioner will assume that provisioning is continuing, which may be correct. If the non-leading provisioner is not doing anything anymore and the librbd connection was not closed, eventually there will be a timeout and re-provisioning will be done.
I'm a little scared by the mention of non-leading provisioner(s) and the general reliance on watches for checking whether the image is in use. I have raised this in some fencing-related issues and PRs before, but once again: while a presence of a watch can be used to tell whether the image is in use (best effort check), the reverse is not true. A lack of a watch is not a reliable indicator of pretty much anything -- the image may still be opened and written to!
In this case, what would prevent the non-leading provisioner that is not reachable by k8s from waking up and continuing its zeroing after the image has been (re)zeroed by the now leading provisioner and handed off to the user? This seems like a potential data corruption scenario to me.
@idryomov: when a watch is lost (initially added once the image was opened), can the opened image still be used for writing? I expected that the rbd-image-handle would become invalid, and writing returns a failure. If there is a better recommendation than to check watches on RBD images, than we can surely implement it that way. Until there is a usable alternative, we probably will need to stick with the watchers for now. |
Yes. A lost watch would be reestablished in due course. Also, it may not know that the watch is lost for a while, etc.
Ultimately it has to be something that would fence other potential writers at the Ceph level ( It seems to be another instance of #578 and similar issues. Whereas before the only concern was the entity that mapped the image (since it was the only potential writer in the system), now that the provisioner can also write to the image it needs to be treated as a potential writer and be subjected to the same safeguards. |
Sure, ultimately @idryomov do you have final comments on this PR ? if not, we will go ahead and merge this 👍 . Thanks for reviewing! |
Nothing substantial. Honestly, the entire scheme with k8s timing out create requests one after another and firing retries at random provisioner instances while zeroing is in progress seems very clumsy to me. I just hope that thick provisioning isn't going to be widely used because otherwise the current implementation will bite pretty soon. |
I was wondering if we could resume from where previous attempt halted ( if we can get progression complete : #2101 (comment)) it could surface this problem to an extent. @idryomov is it something we can get supporte/improve from Ceph side ? especially for operations like
As per current reports from the field, It looks like this is going to be used in a heavy extent. This is becoming default choice for VM usecases. |
Theoretically you could infer where zeroing got interrupted by looking at the provisioned size (i.e. how many
I wonder why? Do you have the details or some links I could peruse? |
@oritwas @mykaul @obnoxxx if possible, can you please share the details of thick provision use cases? |
My impression is that this is mainly provided for users that need more accurate accounting for storage consumption. When an environment is sufficiently planned for the expected workloads, thin-provisioning would surely be the recommendation. It might be difficult to estimate actual storage consumption for VM workloads, but any deployment should make sure that sufficient capacity is available, thick-provisioned or thin. Thick-provisioning can be used as a safety-net when an environment does not have sufficient storage capacity for the planned workload. In that case thick-provisioning can prevent workloads from starting and having a storage cluster running out of space (where recovering becomes very difficult). |
As discussed in today's call, lets go ahead with this interim optimization now and catch up the improvements on the way.. |
Move the repairing of a volume/snapshot from CreateVolume to its own function. This reduces the complexity of the code, and makes the procedure easier to understand. Further enhancements to repairing an exsiting volume can be done in the new function. Signed-off-by: Niels de Vos <[email protected]>
Signed-off-by: Niels de Vos <[email protected]>
b05ce42
to
532ca58
Compare
Describe what this PR does
In case CreateVolume gets interrupted (by provisioner pod restart, or similar) the slow thick-provisioning is aborted. The RBD image will not have the
thick-provisioned
metadata set. Restarting CreateVolume will detect the RBD image, and return the volume which is not completely thick-provisioned.By validating that the CreateVolume request intends to thick-provision the volume, it is possible to detect the need for the
thick-provisioned
metadata in the RBD image. When the metadata is missing, start allocating the image again.Is there anything that requires special attention
If the RBD image has watchers (
isInUse()
returns true), assume that the allocating of the image is still in progress. Because thick-provisioning is slow, Kubernetes tends to time-out the initial CreateVolume request(s). It is expected that thenot going to restart thick-provisioning of in-use image
is reported when Kubernetes restarts the CreateVolume request.Related issues
Fixes: https://bugzilla.redhat.com/1961647
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>
: retest the<job-name>
after unrelatedfailure (please report the failure too!)
/retest all
: run this in case the CentOS CI failed to start/report any testprogress or results