Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding initiator validation for optimised and inaccessible paths #4468

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

manasagowri
Copy link
Contributor

@manasagowri manasagowri commented Feb 22, 2025

Adding initiator side validation for NVMe HA to verify optimised and inaccessible paths.

Logic written:

  1. Fetch namespaces being served by the gateway to be failed in failover
  2. Fetch corresponding devices from initiator and verify that the given gateway is optimised for them before failover
  3. Fetch the gateway serving the above namespaces post failover
  4. Fetch corresponding devices from initiator and verify that the new gateway is optimised for them after failover
  5. Fetch corresponding devices from initiator and verify that the original gateway is optimised for them after failback

Logs - http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-LE1K03/

Copy link
Contributor

@sunilkumarn417 sunilkumarn417 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @manasagowri ,
Thanks for adding up new validation at initiator level. There are few comments on the code changes.
Lets discuss more over these comments.

@@ -692,6 +692,10 @@ def failover(self, gateway, fail_tool):
LOG.info(
f"{list(active[0])} is new and only Active GW for failed {hostname}"
)
active_gw = list(active[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Firstly, lets add client namespace validation in ha.run. Lets not touch main definition.
  2. Return back the active gateway node which is incharge of failed over GW in dict.

@@ -896,6 +905,26 @@ def validate_incremetal_io(write_samples):

LOG.info("IO Validation is Successfull on all RBD images..")

def validate_initiator(self, gateway, namespaces, ana_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets call this from ha.run()

fail_gw_ana_ids.append(gw.ana_group_id)
ns = self.fetch_namespaces(gw, [gw.ana_group_id])
namespaces.extend(ns)
self.validate_initiator(gw, ns, gw.ana_group_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In validate_initiator,

  1. At Initial stage, pass active gateway node object(which has to be failed over next).
  2. Fetch namesapces for this GW from gateway and client side.
  3. Validate optimized path(i.e, basically Active gateway which is passed).
  4. At Failover stage, collect the active gateway from failover methods.
  5. Use this Active GW against namespaces found in Step 2 to validate Optimized path on client is set to new active GW.
  6. At failback condition, collect the failed-over gateway.
  7. Repeat step 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the validation I have already added in the validate_initiator and ha.run methods

empty list if the gateway is inaccessible for all devices
"""
out, _ = self.node.exec_command(
cmd="nvme list --output-format json", sudo=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already part of nvme initiator lib, if not add this lib method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the lib method now. Will make changes accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be required when we go with the uuid approach as discussed

out, _ = self.node.exec_command(
cmd="nvme list --output-format json", sudo=True
)
nvme_devices = json.loads(out).get("Devices", [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At client side,

  • Need to identify the right devices against the namespace passed from GW side(probably using lsblk -o name,wwn).
  • Get the topology for that device(i.e get the path status for each GW path).
  • Validate against the GW which has optimized path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvme list-subsys gives the namespace id and subsystem name, isn't this enough proof to get the required namespaces?

subsystems = json.loads(out)[0].get("Subsystems")
for subsys in subsystems:
for path in subsys.get("Paths"):
if gw_ip in path.get("Address") and path.get("State") == "live" and path.get("ANAState") == "optimized":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create separate method to list out optimized and inaccessible paths for a device.
{"device_wwn": {"optimized": []}, {"inaccessible": []}. use it for validation of paths

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do this change as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants