
Note
I recommend using nf-core/fetchngs This repository is no longer supported as of 2023
A minimal pipeline to download FASTQ files from SRA given a list of accession IDs.
See installation for more details
# Suggestion: replace main with a version from the releases
nextflow run telatin/getreads -r main -profile docker \
--list list.txt --outdir downloaded-reads/
Where:
-
--list "list.txt"
is a list of SRA accession IDs in simple text format -
--outdir "name"
is the name of the output directory -
--wait INT
is the number of seconds to wait after running ffq [default: 2] -
-profile docker
will used Docker for dependencies. An easy alternative is to create a conda environment usingdeps/env.yaml
. Singularity is supported but untested (usually clusters with singularity are offline anyway)
The output directory contains:
- 📁 json (JSON file, one for each accession)
- 📁 urls (text files with the download URIs)
- 📁 reads (FASTQ.gz files, a set per accession)
- 🗒️ stats.txt (reads statistics)
- 🗒️ check.txt (a report of number of files per ID downloaded, with control of number of reads per file being equal)
- 🗒️ table.tsv metadata table from JSON files (only for samples where ffq didn't fail) (new in 2.0)
nf-core/fetchngs ⭐ is a fully-featured pipeline to download reads and associated metadata. It's a fantastic and regularly update tool. Since sometimes it failed for me for reasons related to its complexity, I made this minimal pipeline as a backup plan.
- ffq to fetch URLs given the accessions, wrapped in ffq-sake.py that retries if NCBI responds with "too many requests", but gracefully fails on 400 error.
- wget to download the reads
- seqfu to collect stats
If you use this pipeline, please cite:
- Gálvez-Merchán, Á., et al. (2023). Metadata retrieval from sequence databases with ffq. Bioinformatics
- Telatin, A., et al. (2020). SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. Bioengineering