Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script: Add convert .gt file format to .csv.gz + docs #62

Merged
merged 7 commits into from
Mar 25, 2025
Merged

Conversation

absternator
Copy link
Contributor

@absternator absternator commented Mar 20, 2025

This PR adds script that converts .gt(graph-tool) file to a .csv.gz file. This is needed as GPU graphing library cugraph cannot read .gt file format, thus we need to convert.

This has already been done for all databases and is already in mrcdata/beebop.

There is also some docs updates to run beebop with a gpu

Testing:
run python script and check the .csv.gz file shows up Eg: python gt-to-csv-gz.py -i /home/$USER/code/beebop_py/storage/dbs/gas_database/gas_database_graph.gt -o /home/$USER/code/beebop_py/storage/dbs/gas_database/gas_database_graph.csv.gz

Copy link
Contributor

@EmmaLRussell EmmaLRussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Script ran for me once I got deps installed! I wasn't able to unzip the gz with built-in Extract in Ubuntu, but I guess you've tested that these files work with cugraph!

Couple of little docs suggestions.

README.md Outdated

- The host machine has a GPU and the NVIDIA drivers and cuda-toolkit are installed with correct versions. [Check the NVIDIA documentation](https://docs.nvidia.com/cuda/cuda-installation-guide-linux)
- The necessary libraries for GPU support are installed in your environment.[Check RAPIDS documentation](https://rapids.ai/)
- Ensure any new databases have *graph.csv.gz* file. If not run script in scripts folder: `python gt-to-csv-gz.py` with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Ensure any new databases have *graph.csv.gz* file. If not run script in scripts folder: `python gt-to-csv-gz.py` with
- Ensure any new PopPUNK databases have *graph.csv.gz* file. If not run script in scripts folder: `python gt-to-csv-gz.py` with

README.md Outdated
- `--input` the path to the graph .gt file
- `--output` the path to the output csv.gz file
- In `args.json` set the `gpu_graph` and `gpu_dist` to `True` for both `assign` and `visualise` fields.
*Note: may need to reinstall and **pp-sketchlib, PopPUNK and mandrake** to ensure CUDA enabled versions are installed*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*Note: may need to reinstall and **pp-sketchlib, PopPUNK and mandrake** to ensure CUDA enabled versions are installed*
*Note: may need to reinstall **pp-sketchlib, PopPUNK and mandrake** to ensure CUDA enabled versions are installed*

Would be it easy to say here what the minimum GPU enabled versions of these libraries are?

README.md Outdated
- The necessary libraries for GPU support are installed in your environment.[Check RAPIDS documentation](https://rapids.ai/)
- Ensure any new databases have *graph.csv.gz* file. If not run script in scripts folder: `python gt-to-csv-gz.py` with
- `--input` the path to the graph .gt file
- `--output` the path to the output csv.gz file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does PopPUNK know where to find this? Should it always have the same name as the .gt file, and be output to the same location? If so, maybe that can be assumed by the script and shouldn't be a parameters, or should be the default, or should at least be documented that for normal operation that's what --output value should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yup ive just added a note to say what it should be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually yeah im stupid have removed --output completely now

@absternator absternator merged commit abab580 into main Mar 25, 2025
4 checks passed
@absternator absternator deleted the gt-csv branch March 25, 2025 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants