This guide covers the basic usage methods for the Gadi supercomputer, including how to submit jobs, manage environments, run jobs with limited file numbers, and execute jobs that exceed 48 hours.
-
home
: This directory is for your environment settings and code storage. It offers 10GB of space without any file number limitations. -
/g/data
: This directory is for storing your data. Warning: There is a file number limitation for this folder, so it's recommended to tar your data to avoid issues. -
/scratch
: This is a temporary directory, and it also has limitations. I personally don't recommend storing anything here. If this folder becomes full, it can prevent everyone in your project from using Gadi.
An interactive job allows you to call computing resources directly for debugging. This method is useful when you need to test and troubleshoot your code in real-time.
There are three examples of interactive jobs, each corresponding to a different type of resource: interactive_a100.sh
, interactive_v100.sh
, and interactive_cpu.sh
. These examples demonstrate how to use the A100 GPU, V100 GPU, and CPU only, respectively.
-
Interactive V100 GPU
To request an interactive job using a V100 GPU, use the following command:
qsub -I -q gpuvolta -P wa66 -l walltime=5:00:00,ncpus=12,ngpus=1,mem=90GB,jobfs=300GB,storage=gdata/wa66+gdata/po67+gdata/ey69+gdata/iv96,wd
Let's take v100 GPU as an example:
-
The
-q
option specifies the queue you are in, which can be cpu, v100, or a100. You can refer to the three examples I provided for specific names. -
For v100, one v100 requires 12 CPUs (while a100 requires 16 CPUs).
-
mem
stands for memory. 90GB of memory per GPU is sufficient for most tasks. -
jobfs
is a temporary storage space on the corresponding node, with a maximum of 300GB. -
storage
refers to your corresponding gdata space, which can be stacked across multiple projects. Here, I mount storage from four projects simultaneously.
A batch job is submitted and runs in the background. This method is ideal for running long computations that do not require real-time interaction(48 hours job most).
In the example
folder, batch_job_example.sh
is an example of a batch job. The basic method is similar to an interactive job. When you need to submit it, use the following command:
qsub batch_job_example.sh (your job file)
Recommended: Gadi supports the module load
method for configuring environments. You can load Python and then use Python to create your environment. Gadi supports multiple versions of PyTorch and CUDA, all of which can be used through the module load
method.
For example, to load Python and create an environment, you can use the following commands:
module load python/3.x.x
For more details, please refer to the Environment Modules link.
Due to the file number limitations in the /g/data
directory on Gadi, we cannot install Miniconda there. However, since the home
directory only has 10GB of space, some tricks are needed when installing Miniconda.
When setting up a deep learning environment, using conda install pytorch
often exceeds the 10GB limit due to additional installations. Therefore, I recommend using pip install
as much as possible. Based on my tests, this method helps keep the size within the 10GB limit.
Here is a basic guide to setting up a Miniconda environment:
-
Download and install Miniconda in your
home
directory:wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda
-
Initialize Miniconda:
source $HOME/miniconda/bin/activate
-
Create a new conda environment:
conda create -n myenv python=3.x
-
Activate the new environment:
conda activate myenv
-
Install necessary packages using
pip
:pip install torch pip install torchvision pip install other_packages
By following these steps and using pip install
, you can effectively manage the space and file number limitations on Gadi.
-
Using
tar
to Package Files:You can package your files using
tar
(without compression). When you need to use the dataset, you can extract it to the$PBS_JOBFS
temporary directory. This directory is on the node where your resources are allocated, and you can decide how much space to allocate (up to 300GB, with no file number limitations). The data in this temporary folder will be deleted after the job ends.Example command to untar data:
tar -xf /g/data/wa66/Xiangyu/Data/LibriSpeech.tar -C $PBS_JOBFS
For more details, refer to
batch_job_example.sh
. -
Using the
transformers
Package Dataset Class: Thetransformers
package includes adatasets
class that allows you to organize your data into a single file. This method can also help you manage large datasets efficiently.For more details, please visit the Hugging Face Datasets documentation.
-
Using Kaldi Supported
flac.ark
Format:Another option is to use the
flac.ark
format supported by Kaldi, which can help manage large numbers of audio files efficiently.
Gadi has a maximum job runtime limit of 48 hours. If you need to run a job for longer than this, you can refer to the self_submit.sh
script, which contains various methods and examples. Here, I will provide the simplest method.
To automatically resubmit a job after it finishes, you can use the following command:
qsub -z -W depend=afterany:PBS_JOBID PBS_JOBNAME
In this example:
PBS_JOBID
is the ID of your currently running job (you can find it usingqstat
).PBS_JOBNAME
is the name of the job you want to continue running or a new job.
For instance, if your job is batch_job_example.sh
and it cannot complete within 48 hours, after submitting it with qsub batch_job_example.sh
, you will receive a job ID (e.g., 1234). Then, you can resubmit the job using:
qsub -z -W depend=afterany:1234 batch_job_example.sh