Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[drenv] ramen-dr-cluster pods crashes on managed cluster since volsync deployment is missing in regional-dr-kubevirt.yaml #1787

Open
pruthvitd opened this issue Jan 30, 2025 · 4 comments

Comments

@pruthvitd
Copy link
Member

RamenDR deployment on managed cluster expects volsync CRDs to be deployed on the clusters in all the usecases be it ceph-rbd only or cephfs only environment.

ramen pod restart:

(ramen) [root@ramen1 ramen]# oc get po -n ramen-system --context dr1
NAME                                         READY   STATUS    RESTARTS        AGE
ramen-dr-cluster-operator-7b7c44f9dd-gm6kv   2/2     Running   2 (2m16s ago)   6m45s
(ramen) [root@ramen1 ramen]# oc get po -n ramen-system --context dr2
NAME                                         READY   STATUS    RESTARTS        AGE
ramen-dr-cluster-operator-7b7c44f9dd-x9sn9   2/2     Running   2 (2m21s ago)   6m47s

Reason for Pod crash:

(ramen) [root@ramen1 ramen]# oc logs -f ramen-dr-cluster-operator-7b7c44f9dd-gm6kv -n ramen-system --context dr1 --previous
2025-01-30T07:16:15.912Z	INFO	setup	controller/ramenconfig.go:66	loading Ramen configuration from 	{"file": "/config/ramen_manager_config.yaml"}
2025-01-30T07:16:15.914Z	INFO	setup	cmd/main.go:103	controller type	{"type": "dr-cluster"}
2025-01-30T07:16:16.024Z	INFO	controllers.VolumeReplicationGroup	controller/volumereplicationgroup_controller.go:68	Adding VolumeReplicationGroup controller
2025-01-30T07:16:16.118Z	INFO	controllers.VolumeReplicationGroup	controller/volumereplicationgroup_controller.go:106	VolSync enabled; adding owns and watches
2025-01-30T07:16:16.118Z	INFO	controllers.VolumeReplicationGroup	controller/volumereplicationgroup_controller.go:1793	Kube object protection enabled; watch kube objects requests
:
:
2025-01-30T07:18:35.812Z	ERROR	controller-runtime.source.EventHandler	source/kind.go:71	if kind is a CRD, it should be installed before calling Start	{"kind": "ReplicationSource.volsync.backube", "error": "no matches for kind \"ReplicationSource\" in version \"volsync.backube/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:71
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64
2025-01-30T07:18:36.008Z	ERROR	controller/controller.go:200	Could not wait for Cache to sync	{"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup", "error": "failed to wait for volumereplicationgroup caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.ReplicationDestination"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:200
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226
2025-01-30T07:18:36.009Z	INFO	manager/internal.go:538	Stopping and waiting for non leader election runnables
2025-01-30T07:18:36.009Z	INFO	manager/internal.go:542	Stopping and waiting for leader election runnables
2025-01-30T07:18:36.010Z	INFO	controller/controller.go:237	Shutdown signal received, waiting for all workers to finish	{"controller": "drclusterconfig", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRClusterConfig"}
2025-01-30T07:18:36.010Z	INFO	controller/controller.go:237	Shutdown signal received, waiting for all workers to finish	{"controller": "protectedvolumereplicationgrouplist", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ProtectedVolumeReplicationGroupList"}
2025-01-30T07:18:36.010Z	INFO	controller/controller.go:239	All workers finished	{"controller": "drclusterconfig", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRClusterConfig"}
2025-01-30T07:18:36.010Z	INFO	controller/controller.go:217	Starting workers	{"controller": "replicationgroupdestination", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ReplicationGroupDestination", "worker count": 50}
2025-01-30T07:18:36.010Z	INFO	controller/controller.go:237	Shutdown signal received, waiting for all workers to finish	{"controller": "replicationgroupdestination", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ReplicationGroupDestination"}
2025-01-30T07:18:36.010Z	INFO	controller/controller.go:239	All workers finished	{"controller": "replicationgroupdestination", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ReplicationGroupDestination"}
2025-01-30T07:18:36.010Z	INFO	controller/controller.go:217	Starting workers	{"controller": "replicationgroupsource", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ReplicationGroupSource", "worker count": 50}
2025-01-30T07:18:36.010Z	INFO	controller/controller.go:239	All workers finished	{"controller": "protectedvolumereplicationgrouplist", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ProtectedVolumeReplicationGroupList"}
2025-01-30T07:18:36.011Z	INFO	controller/controller.go:237	Shutdown signal received, waiting for all workers to finish	{"controller": "replicationgroupsource", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ReplicationGroupSource"}
2025-01-30T07:18:36.011Z	INFO	controller/controller.go:239	All workers finished	{"controller": "replicationgroupsource", "controllerGroup": "ramendr.openshift.io", "controllerKind": "ReplicationGroupSource"}
2025-01-30T07:18:36.011Z	INFO	manager/internal.go:550	Stopping and waiting for caches
2025-01-30T07:18:36.013Z	INFO	manager/internal.go:554	Stopping and waiting for webhooks
2025-01-30T07:18:36.013Z	INFO	manager/internal.go:557	Stopping and waiting for HTTP servers
2025-01-30T07:18:36.013Z	INFO	controller-runtime.metrics	server/server.go:254	Shutting down metrics server with timeout of 1 minute
2025-01-30T07:18:36.013Z	INFO	manager/server.go:68	shutting down server	{"name": "health probe", "addr": "[::]:8081"}
2025-01-30T07:18:36.107Z	INFO	manager/internal.go:561	Wait completed, proceeding to shutdown the manager
2025-01-30T07:18:36.110Z	ERROR	setup	cmd/main.go:291	problem running manager	{"error": "failed to wait for volumereplicationgroup caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.ReplicationDestination"}
main.main
	/workspace/cmd/main.go:291
runtime.main
	/usr/local/go/src/runtime/proc.go:271
@pruthvitd
Copy link
Member Author

After the ramen pod restarts, volsync is disabled:

(ramen) [root@ramen1 ramen]# oc logs -f ramen-dr-cluster-operator-7b7c44f9dd-gm6kv -n ramen-system --context dr1
2025-01-30T07:18:49.707Z	INFO	setup	controller/ramenconfig.go:66	loading Ramen configuration from 	{"file": "/config/ramen_manager_config.yaml"}
2025-01-30T07:18:49.713Z	INFO	setup	controller/ramenconfig.go:76	s3 profile	{"key": 0, "value": {"s3ProfileName":"minio-on-dr1","s3Bucket":"bucket","s3CompatibleEndpoint":"http://192.168.122.41:30000","s3Region":"us-west-1","s3SecretRef":{"name":"ramen-s3-secret-dr1","namespace":"ramen-system"}}}
2025-01-30T07:18:49.713Z	INFO	setup	controller/ramenconfig.go:76	s3 profile	{"key": 1, "value": {"s3ProfileName":"minio-on-dr2","s3Bucket":"bucket","s3CompatibleEndpoint":"http://192.168.122.178:30000","s3Region":"us-east-1","s3SecretRef":{"name":"ramen-s3-secret-dr2","namespace":"ramen-system"}}}
2025-01-30T07:18:49.713Z	INFO	setup	cmd/main.go:103	controller type	{"type": "dr-cluster"}
2025-01-30T07:18:49.829Z	INFO	controllers.VolumeReplicationGroup	controller/volumereplicationgroup_controller.go:68	Adding VolumeReplicationGroup controller
2025-01-30T07:18:49.911Z	INFO	controllers.VolumeReplicationGroup	controller/volumereplicationgroup_controller.go:109	VolSync disabled; don't own volsync resources
2025-01-30T07:18:49.911Z	INFO	controllers.VolumeReplicationGroup	controller/volumereplicationgroup_controller.go:1793	Kube object protection enabled; watch kube objects requests
2025-01-30T07:18:50.424Z	INFO	setup	cmd/main.go:288	starting manager

@raghavendra-talur
Copy link
Member

Possible that ramenctl deploy deployed with the default configmap and then only ramenctl config changed the config based on the test env provided. @nirs

@nirs
Copy link
Member

nirs commented Jan 30, 2025

Possible that ramenctl deploy deployed with the default configmap and then only ramenctl config changed the config based on the test env provided. @nirs

Yes this is the way it should work. You must run ramenctl deploy and ramenctl config to get a working setup.

We have an issue to merge the config command into deploy to simplify the common developer workflow.

@nirs
Copy link
Member

nirs commented Jan 30, 2025

@pruthvitd if this is about unexpected ramen crash when using the regional-dr-kubevirt environment we can close this issue since he behavior is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants