Pantagruel: a bioinformatic pipeline for the inference of gene evolution scenarios in bacterial pangenomes.
Due to the interuption of free support on continuous builds on DockerHub, the dependency images for Pantagruel are now also hosted on Quay.io.
docker pull quay.io/flass/pantagruel-dep:master
The worries of installing Pantagruel are in the past! now you just have to download the docker image from the Dockerhub repository!
docker pull flass/pantagruel-dep:master
See details on the Docker section of the INSTALL page. Note that the image is quite large (2.05 GB) so make sure you have the relevant space on your machine (and on the filesystem partition that hosts the docker client). Also, InterProScan (used in task 04) is NOT included in docker image, see here.
The Dockerfile (a 'recipe' to generate a Docker image) is now ready!
Use it to generate a Docker image containing all Pantagruel dependencies; then use that image to run pantagruel
commands within the environment provided by the container, as explained here.
NB: this may be used to generate a Singularity container as well, even though it was not tested yet.
The preprint describing the software and underlying methods is available on bioRxiv preprint server at:
https://www.biorxiv.org/content/10.1101/586495v3.
If using this software, please cite:
Lassalle F, Veber P, Jauneikaite E, Didelot X. Automated Reconstruction of All Gene Histories in Large Bacterial Pangenome Datasets and Search for Co-Evolved Gene Modules with Pantagruel.” bioRxiv 586495. doi: 10.1101/586495.
Pantagruel provides an all-in-one software solution to reconstruct the complex evolutionary process of diversification of bacterial genomes.
From a dataset of bacterial genomes, builds a database that describes the homology structure of all genes in the dataset -- the pangenome. With that, Pantagruel will first generate two key outputs:
- a reference (species) tree, depicting the main signal of evolutionary relationships between input genomes;
- gene trees, depicting the evolutionary relationships between gene sequences within a family of homologous genes.
A scenario of gene evolution is then inferred for each gene family in the dataset by reconciling the topology of gene trees with the reference species tree in a probabilistic framework.
Such scenario describes the likely events of gene duplication, horizontal transfer and loss (DTL model) that marked the gene family history. These events are annotated on the branch of the gene tree and the of the reference tree, and make their history consistent. From these annotations, one can derive the history of gain and loss of this gene family over the reference tree of species, and follow the diversification of gene lineages within and across genome lineages.
Gene tree/species tree reconciliation methods rely on the best available information to infer such scenarios, as they account for the phylogeny of genes; a probablilistic method is chosen to quantify the statistical support for inferences, in face of the large space of possible scenario and to account for the uncertainty in the input gene phylogeny.
While probablilistic reconciliation methods are computationally costly, this pipeline uses innovative phylogenetic apporoaches based on the reduction of gene trees to their informative backbone, allowing their use in a resonable time on datasets of 1,000+ bacterial genome and covering multiple species.
Gene history data are then gathered in the database, which provides a way to:
- quantify gene-to-gene co-evolution, i.e. a gene association score based on the evolutionary events (notably HGT) they shared (quantified at the gene lineage level rather than the whole gene family);
- classify genes into orthologous clusters based on the gain/loss scenarios, from which one can define clade-specific gene sets.
Two version of the pipeline are distributed:
-
a script version, which source code is adaptable and can be deployed on high-performance computing (HPC) "cluster" Linux systems;
-
(in development) a pre-compiled Docker image that can be deployed on pretty much any platform, including swarms of virtual machines (VMs). Future versions will be implemented using Philippe Veber's Bistro framework.
See below for instruction on software installation and usage. Impatients can go directly to running examples.
The pipeline can be run using a single interface to deploy the several arms of the pipeline.
It first requires to initiate the Pantagruel database, i.e. giving it a name, creating the base file structure, defining main options.
The generic syntax at the init stage is as follows, passing the key parameters using options:
pantagruel -d db_name -r root_dir {-A refseq_assembly_folder | -a custom_assembly_folder} [other options] init
Alternatively, the various parameters can be set directly in a Pantagruel configuration file specified with option -i
:
pantagruel -d db_name -r root_dir -i config_file init
The configuration file config_file
can be generated by editing a copy of the template environment script. Note than it is only safe to edit the top parameters.
Then, the pipeline can be run step-by-step by performing a specified task:
pantagruel -i config_file TASK
with TASK
to be picked among the following (equivalent digit/number/keywords are separated by a |
):
0|00|fetch|fetch_data
fetch public genome data from NCBI sequence databases and annotate private genomes;
also run a quick genome-to-genome distance estimation with MASH
1|01|homologous|homologous_seq_families
classify protein sequences into homologous families
2|02|align|align_homologous_seq
align homologous protein sequences and translate alignemnts into coding sequences
3|03|sqldb|create_sqlite_db
initiate SQL database and load genomic object relationships
4|04|functional|functional_annotations
use InterProScan to functionally annotate proteins in the database, including with Gene Ontology and metabolic pathway terms
5|05|core|core_genome_ref_tree
select core-genome markers and compute reference tree
6|06|genetrees|gene_trees
compute gene tree
7|07|reconciliations
compute species tree/gene tree reconciliations
8|08|specific|clade_specific_genes
classify genes into orthologous groups (OGs) and search clade-specific OGs
9|09|coevolution
quantify gene co-evolution and build gene association network
Alternatively, several tasks ca be run at once by providing a space-separated string of tasks identifiers:
pantagruel -i config_file TASK1 TASK2 ...
Finally, it is possible to run the whole pipeline at once, simply perform the all
task:
pantagruel -i config_file all
Note there are dependencies between tasks, which must be carried on mostly sequentially:
- 00, 01, 02, 03 tasks each strictly depend on the previous step: 00 -> 01 -> 02 -> 03
- functional annotation task 04 is optional - though highly recomended - and depends on the previous task 01: 01 -> 04
- reference tree task 05 depends on previous task 03 (and thus the previous ones)
- gene trees task 06 only depends on the previous task 03 (and thus the previous ones) IF the
-c|--collapse
option is NOT used: 03 -> 06 - however, if the
-c
option is specified, task 06 (specifically step 6.4 when in HPC mode) is also dependent on task 05: 03 + 05 -> 06 - gene tree/species tree reconciliation task 07 strictly depends on the previous steps: 05 + 06 -> 07
- orthologous group clustering task 08 depends on previous reconciliation step 07: 07 -> 08
- co-evolution network task 09 depends on previous reconciliation task 07: 07 -> 09
- but if run after task 08, an additional version of the co-evolution network will be made by collapsing the full network, grouping gene nodes by orthologous group: 07 + 08 -> 09
So all in all, you're better off running all the tasks sequentially, for instance using pantagruel all
.
Importantly, it is recomended to use the alternative interface provided by HPC scripts to run intensive tasks 06
and 07
on high-performance computer (HPC) clusters for efficient and time-tractable computing; see HPC scripts doc below.
Here is a standard examples of using pantagruel
program.
First, to create a new database, we need to run the init
task. To pass the key parameters, including where to create the database and its name, we will be using options:
pantagruel -d databasename -r /root/folder/for/database -f PANTAGFAM -I [email protected] \
-A /folder/of/public/genome/in/RefSeq/format init
or
pantagruel -d databasename -r /root/folder/for/database -f PANTAGFAM -I [email protected] \
-L list_of_RefSeq_accession_ids -a /folder/of/custom/genomes init
Then, to actually run the pipeline, we will execute the subsequent tasks.
At this stage, no options need to (or can) be specified trough the command line, as all parameters are already defined
following the database intitiation stage (see above) and were stored in a configuration file. You will now simply have to specify where to find this configuration file with the -i
option.
Unless you moved it, the configuration file should be where it has been created automatically,
at ${root_dir}/${db_name}/environ_pantagruel_${db_name}.sh
, with ${db_name}
and ${root_dir}
the arguments of -d
and -r
options on the pantagruel init
call.
So in our case, to execute the first three tasks, up to gene family sequence alignement, you can type the following command:
pantagruel -i /root/folder/for/database/databasename/environ_pantagruel_databasename.sh fetch homologous align
Note that this config file can be edited in-between tasks, for instance to change the location of key input files that you moved, or to tweak paramters - however this may cause issues in task dependencies (see above).
Please note that Pantagruel is still under active development and may evolve rapidly to fix bugs and solve issues.
It is thus strongly recomended to update this software regularly using git pull && git submodule update
in the pantagruel/
git repository folder.
- TIP!!: once the database environment is loaded (by doing
source environ_pantagruel_databasename.sh
), you can use the aliaspanup
for easy software updates.
If Pantagruel is updated in-between the running of tasks of a database project,
it is higly recommended to run the init
task again (with the same options) before proceeding to the next steps,
to ensure the environment defined by the configuration file is compatible with the software.
Regenerating the configuration file will not erase other data.
A simple way to regenerate the configuration file under the same parameters is to use the command:"
pantagruel -i previous_config_file --refresh init
Note that for the sake of computing evolutionary analyses that have any meaning at all, Pantagruel requires that you provide a minimum of four genomes in input -
ideally much more, as Pantagruel can easily deal with several hundreds of genome.
This minimum number 4 is to be split at the user's discretion between RefSeq-type and 'custom assemblies, through -A
/-L
and -a
options, respectively.
Here is a view of what input data passed to Pantagruel should look like.
When specifying accession ids to be downloaded from NCBI Assembly FTP using -L|--refseq_list
or --refseq_list4annot
, the list file which path is given as the option's argument should look like this:
# for instance if used the command `pantagruel -d databasename -r /root/folder -L ./list_of_RefSeq_accession_ids init`
cat list_of_RefSeq_accession_ids
# GCF_000156855
# GCF_001026115.1
# GCF_001088845.2
# ...
# as many rows as there are genomes to study
# Note the trailing '.z' (z = 1,2,3,...) indicating the assembly version is optional
# and if it is provided, it WILL BE IGNORED.
# This is beacause the LAST version of the accession will always be returned.
# If you really want to work with an outdated version, please download it yourself
# and feed it to Pantagruel using '-A' option.
When using assemblies downloaded from NCBI RefSeq using options -A|--refseq_ass
or --refseq_ass4annot
, the folder which path is given as the option's argument should have a content looking like this:
# for instance if used the command `pantagruel -d databasename -r /root/folder -A ./assemblies init`
ls -AF assemblies/
# GCF_000156855.2_ASM15685v2/
# GCF_001026115.1_ASM102611v1/
# GCF_001088845.1_8727_5_60/
# ...
# as many separate assembly folders as there are genomes to study
ls -AF assemblies/GCF_001026115.1_ASM102611v1/
# GCF_001026115.1_ASM102611v1_assembly_report.txt GCF_001026115.1_ASM102611v1_protein.faa.gz
# GCF_001026115.1_ASM102611v1_assembly_stats.txt GCF_001026115.1_ASM102611v1_protein.gpff.gz
# GCF_001026115.1_ASM102611v1_cds_from_genomic.fna.gz GCF_001026115.1_ASM102611v1_rna_from_genomic.fna.gz
# GCF_001026115.1_ASM102611v1_feature_count.txt.gz GCF_001026115.1_ASM102611v1_translated_cds.faa.gz
# GCF_001026115.1_ASM102611v1_feature_table.txt.gz GCF_001026115.1_ASM102611v1_wgsmaster.gbff
# GCF_001026115.1_ASM102611v1_genomic.fna.gz GCF_001026115.1_ASM102611v1_wgsmaster.gbff.gz
# GCF_001026115.1_ASM102611v1_genomic.gbff annotation_hashes.txt
# GCF_001026115.1_ASM102611v1_genomic.gbff.gz assembly_status.txt
# GCF_001026115.1_ASM102611v1_genomic.gff.gz md5checksums.txt
# each folder contains a set of files with all the assembly and annotation information
When providing your on 'custom' genomes, the folder which path is given as the argument of options a|--custom_ass
should have a content looking like this:
# for instance if used the command `pantagruel -d databasename -r /root/folder -a ./user_genomes init`
ls -AF user_genomes/
# contigs/
# annotation/
# strain_infos_databasename.txt
ls -AF user_genomes/contigs/
# B03_1.fasta
# D03_1.fasta
# F03_1.fasta
# ...
There should be as many separate genomic FASTA files in user_genomes/contigs/
as there are genomes to study.
ls -AF user_genomes/annotation/
# B03_1/
# C03_1/
# D03_1/
# ...
There can be an arbitrary number of annotation folders, only those which names match a genomic FASTA files in contigs/ will be considered
ls -AF user_genomes/annotation/B03_1/
# Rhizobium_endolithicum_Q54.fna
# Rhizobium_endolithicum_Q54.ffn
# Rhizobium_endolithicum_Q54.faa
# Rhizobium_endolithicum_Q54.gff
The file names within each annotation folder do not matter, only their extensions.
cat user_genomes/strain_infos_databasename.txt
# assembly_id genus species strain taxid locus_tag_prefix
# B03_1 Rhizobium endolithicum Q54 1335060 REQ54
# ...
There should be as many rows as FASTA files in the contigs/
folder, plus one header row.
Fields should be tab-separated; the header must contain these field names: assembly_id
, genus
, species
, strain
, taxid
, locus_tag_prefix
; their order does not matter.
The values in assembly_id
and locus_tag_prefix
fields must be unique per row.
The assembly_id
field value must match the begin of contig file name and match exactly the annotation folder name.
The characters '-'
and '_'
are forbiden in the locus_tag_prefix
field.
Options are detailed here:
# for Pantagruel task 0-9:
_only one mandatory option_
-i|--initfile path to Pantagruel configuration file.
this file is generated at init stage, from the specified options.
_facultative runtime options_
-F|--FORCE (no value) FORCE mode: will erase any pre-existing main folder for the task
(default: off, pre-exisitance of a folder will result in an early error)
-R|--resume (no value) try and resume the task from previous run that was interupted
(available for tasks 04-07)
-N|--threads specify the number of threads to use for (some) parrallelizable tasks (defaults to the maximum available))
(available for tasks: 00-08)
Note 1: this does not apply to ALE/ecceTERA reconciliations,
which jobs always run sequentially due to their high memory needs.
Note 2: this does not apply to tasks run through HPC scripts,
which have their own interface to define the number of CPUs to be used.
-z|--compress will try and compress result file on the go (especially bulky files that won't be used much later
This will induce possible extra decompressing/re-generating data computing time
when resuming a task run with -R; avoid using compression when likely to have to resume later
(for the moment only available for tasks: 00, 02 i.e. 'fetch', 'align')
-v|--verbose increase verbosity of logs (for the moment only affecting for tasks: 08 i.e. 'orthologs')
# for Pantagruel task init:
_mandatory options_
-d|--dbname string. database name
-r|--rootdir path to root directory where to create the database; defaults to current folder
It is also necessary to specify an input genome dataset!
This is possible via -a, -A or -L options, or a mixture of them.
_facultative options_
-i|--initfile path to Pantagruel configuration file.
a file can be derived (i.e. manualy curated) from 'environment_pantagruel_template.sh' template.
Parameters values specified in this file will override other options.
Can also be combined alone with --refresh to update the software version used for an existing database.
--refresh (no value) Use in combination with the -i option above to simply refresh the configuration file
(e.g. after an update of the software). The program will simply re-run the `pantagruel [options] init` command
that has been previously used to generate the config file; hence there is no need to repeat any other option
(even -d and -r options can be omitted if `pantagruel -i config_file --refresh init` is used).
Note that when options had quoted string arguments, unpredictable behaviour might occur;
please verify the outcome in the regenerated config file.
New options can be added _after_ the --refresh option to change the value of environment variables in the config file:
to set non-default values if not already, e.g. turn on collapsing:
`pantagruel -i config_file --refresh -c init`
to change the value of an already set option, e.g. switch to use GeneRax reconciliation method:
`pantagruel -i config_file --refresh -e GeneRax init`
to both turn on collapsing and switch to use GeneRax reconciliation method:
`pantagruel -i config_file --refresh -c -e GeneRax init`
Currently, only short options (e.g., -e or -n, NOT --rec_method or --collapse) are supported after --refresh.
General configuration options:
-I|--iam database creator identity (e-mail address is preferred)
-f|--famprefix alphanumerical prefix (no number first) of the names for homologous protein/gene family clusters; defaults to 'PANTAG'
the chosen prefix will be appended with a 'P' for protein families and a 'C' for CDS families.
--path_to_interproscan path to the InterProScan executable script, or to the folder containing an executable file named \`interproscan\`,
which itself should link to the script \`interproscan.sh\` that is found in the InterProScan software archive.
Defaults to the empty string, meaning that Pantagruel will look for the \`interproscan\` command in the \$PATH.
Using this option is mandatory to execute task 04 when when calling \`pantagruel\` through the docker image.
-u|--update_from the new pantagruel database will be an update from a template/reference pantagruel database found at this path.
Requirements:
1) The genome set covered by the template db should be included in the genome set of the new db
2) the gene family prefix tag provided by option -F should be the same
As a result of the update, all gene family identifiers will correspond to the same
families between the datasets (unless when they are unique to the new genomes).
Relevant files from the source database will be linked over or reused during
Pantagruel tasks 00 and 01; dowstream tasks will be entirely computed de novo.
Input options:
-T|--taxonomy path to folder of taxonomy database flat files. Defaults to $rootdir/NCBI/Taxonomy_YYYY-MM-DD (suffix is today's date)
if this is not containing the expected file, triggers downloading the daily dump from NCBI Taxonomy at task 00
-A|--refseq_ass path to folder of source genome assembly __folders__ containing flat files formated like NCBI Assembly RefSeq (no default value).
The assembly folders (one per genome) can be obtained on https://www.ncbi.nlm.nih.gov/assembly,
by making a keyword search to select a set of assemblies and downloading results with options:
Source Database = 'RefSeq' and File type = 'All file types (including assembly-structure directory)'.
A simple archive 'genome_assemblies.tar' (as obtained from the NCBI website) can be placed in that folder.
If user genomes are also provided, these RefSeq assemblies will be used as reference for their annotation.
--refseq_ass4annot folder path. Same principle as -A, but WILL NOT be used in the study, only as a reference to annotate user genomes
(defaults to combined value of -A and -L options)
-L|--refseq_list file path. Same as -A|--refseq_ass, but just provide a list of NCBI Assembly accession ids (file with one accession id per row)
Accession ids are of the form GCx_yyyyyyyyy.z with x = {A|F} for GenBank and RefSeq, respectively, and y an z are any digit.
These accessions will be fetched from the NCBI FTP site using lftp.
Note the LAST version of the accession will be returned, i.e. the trailing '.z' part of the accession id is ignored.
These assemblies saved into a folder named after the value of the option:
for instance, \`-L /path/to/assemblist\` will save assembly folders in /path/to/assemblist_assemblies_from_ftp/.
--refseq_list4annot file path. Same principle as -L, but WILL NOT be used in the study, only as a reference to annotate user genomes
(defaults to combined value of -A and -L options)
-a|--custom_ass path to folder of user-provided genomes (no default value). The specified folder must contain:
_mandatory_
- a 'contigs/' folder, where are stored multi-FASTA files of genome assemblies (one file per genome,
with extension '.fa', '.fasta' or '.fas' ...). Fasta file names will be truncated by removing
the '.fa' string and everything occuring after) and will be retained as the assembly_id (beware
of names redundant with RefSeq assemblies).
- a 'strain_infos_${databasename}.txt' TAB-delimited file describing the organism, with ${databasename} the value of option -d"
columns should be headed with these fields (replace quotes and semicolons by tabs!):"
'sequencing_project_id'; 'genus'; 'species'; 'strain'; 'taxid'; 'locus_tag_prefix'
'sequencing_project_id' must match the name of a contig file (e.g. 'seqProjID.fasta')
'locus_tag_prefix' must match the prefix of ids given to CDS, proteins and genome regions (contigs)
in potentially provided annotation files (see below).
Note that for ensuring compatibility with dependencies (namely BioPython Nexus alignment parser
and ALE reconciliation program), the characters '-' and '_' are forbidden in the 'locus_tag_prefix' field. ' \/' characters are forbidden in the strain name.
Formating of this file can be validated using scripts/validate_straininfo_file.py
_optional_
- an 'annotation/' folder, where are stored annotation folders, one per genome. Each genome annotation file set
must be stored in a separate folder, which name must match a contig file (e.g. 'seqProjID/' for 'seqProjID.fasta').
Alternatively, each folder can be provided as a .tar or .tar.gz archive containing the annotation folder
(e.g. 'seqProjID.tar' or 'seqProjID.tar.gz', containing the folder 'seqProjID/').
each genome annotation folder should contain:
- one mandatory in GFF 3.0 file format (with a '.gff' extension);
and optionally, the following files (with consistent ids!!):
- one in GenBank flat file format (with a '.gbk' extension);
- one in Fasta format containing CDS sequences (with a '.ffn' extension).
- one in Fasta format containing matching protein sequences (with a '.faa' extension).
If at least one of the .gbk, .ffn or .faa are missing, all three will be derived from the .gff source.
NOTE: These four files are typically produced when using Prokka for annotation; thus to ensure proper parsing,
it is strongly advised that any provided annotation was generated with Prokka
NOTE: to ensure uniform annotation of the dataset, it is advised to let Pantagruel annotate the contigs (calling Prokka)
--strain_info path to an optional custom strain information file, provided in the same format as described above for 'strain_infos_\$\{databasename\}.txt'
This is only taken into account in the basence of option -a, i.e. only when using options -A or -L to specify input genomes
from GenBank/RefSeq (or genomes with compliant formats). This allows to override automated genome code generation.
-V|--env_var quoted string of the form: 'variable1=value1[,variable2=value2[,...]]'.
Will add these variables to the configuration file so they can be exported to the environment during tasks.
Can be useful to define custom values of generic variables, e.g. "refgenus=Escherichia,seqcentre=Sanger_Institute"
Output: core genome / reference phylogeny options:
-s|--pseudocore integer, float <=1.0 or string. The minimum number or fraction of genomes in which a gene family should be present
to be included in the pseudo-core genome, i.e. the gene set which alignments will be concatenated for reference tree search.
A non-numeric value will trigger an INTERACTIVE prompt for search of an optimal value at the begining of task 'core'.
Defaults to the total number of genomes (strict core genome set).
-t|--reftree Newick format tree file path. Specifies a reference tree for reconciliation and clade-specific gene analyses;
cancels the computation of tree from the concatenate of (pseudo-)core genome gene during task 'core'.
--core_seqtype {cds|prot} defines the type of sequence that will be used to compute the (pseudo-)core genome tree (default to 'cds')
--pop_lg_thresh real. Defines the threshold of branch length for delinating populations in the reference tree
(default: 0.0005 for nucleotide alignment-based tree; 0.0002 for protein-based)
--pop_bs_thresh real. Defines the threshold of branch support for delinating populations in the reference tree (default: 80)
--rooting string. Defines the method to root the reference tree during task 5|core_genome_ref_tree.
Possible values are 'treebalance', 'MAD' and 'outgroup:SPECIESCODELIST' (default: 'treebalance'),
- 'treebalance' uses the '-f I' algorthm of RAxML to root the tree towards an optimal balance of branch lengths
on either sides of the root;
- 'MAD' uses the minimal ancestor deviation method described in \"Tria, et al. (2017) Nat. Ecol. Evol. 1, 0193\".
- 'outgroup:SPECIESCODELIST' will root according tothe specified outgroup(s), with SPECIESCODELIST a comma-sperated list of species ids:
'outgroup:SPECIESCODE' for rooting with a single species
'outgroup:SPECIESCODE1,SPECIESCODE2,... for mutilple species (in which case their MRCA in the tree will be the outgroup)
Species ids can be either valid genome assembly ids of the relevant input genomes (typically a NCBI Assembly accession id),
or internal genome identifiers that are specifically in the Pantagruel database but often match the relevant Uniprot organism code.
The mapping between genome accession ids and organism codes is given in the file '03.database/genome_codes.tab' generated during task 3.
To use codes, you may thus want run task 3 first, then run task init again with this option to regenerate the config file with
the desired outgroup organism codes and only then run task 5.
-S|--snp_aln reduce the core-genome alignment to SNPs
Output: gene trees / reconciliations options:
-H|--submit_hpc full address (hostname:/folder/location) of a folder on a remote high-performance computating (HPC) cluster server.
This indicate that computationally intensive tasks, including building the gene tree collection
('genetrees') and reconciling gene tree with species tree ('reconciliations') will be run
on a HPC server (only Torque/PBS and LSF job submission systemd are supported so far).
[support for core genome tree building ('core') remains to be implemented].
Instead of running the computations, scripts for HPC cluster job array submission will be generated automatically.
Data and scripts will be transfered to the specified address (the database folder structure
will be duplicated there, but only relevant files will be synced). Note that job submission
scripts will need to be executed manually on the cluster server.
If set at init stage, this option will be maintained for all tasks. However, the remote address
can be updated when calling a specific task; string 'none' cancels the HPC behaviour.
-c|--collapse (no value) Enable collapsing the rake clades in the gene trees (strongly recomended in datasets of size >50 genomes).
-n|--no_collapse (no value) disable collapsing the rake clades in the gene trees"
(default; use this option in combination with -i --refresh to restore default behaviour when -c was used in previous runs)."
-C|--collapse_par quoted string. Specifies parameters for collapsing the rake clades in the gene trees.
A single-quoted, semicolon-delimited string containing variable definitions must be provided.
Default is equivalent to providing the following string:
'cladesupp=70 ; subcladesupp=35 ; criterion=bs ; withinfun=median'
-e|--rec_method {ALE|ecceTERA} choose the method to reconcile gene trees and the species tree.
ALE (default): a probabilistic method to sample gene Duplication, Transfer and Loss (DTL) scenarios
by amalgamating the likelihood of bayesian samples of trees (doi:10.1093/sysbio/syt003;doi:10.1093/sysbio/syt054).
The likelihood-based approach can be heavy in memory use (several 10GB for one gene family scenario) and computation time.
The option '-c' (to collapse gene trees prior to reconciliation) efficiently mitigates this issue as it generally reduces
the compute time to minutes is highly recommmended when the dataset size grows (>50 bacterial genomes).
ecceTERA: a parsimony method to sample gene DTL scenarios by amalgamating the likelihood of bayesian samples of trees
under a model and procedure similar to ALE (doi:10.1093/bioinformatics/btw105).
The parsimony apporach allows the use of this methods on large-scale datasets within a reasonable time and using little memory
without having to resort to gene tree collapsing with option '-c' (but using it is possible and would make reconciliation even faster).
-g|--genefam_list Path to gene family list file. Resticts the computation of gene trees and all subsequent analyses to a list of gene families.
This impacts all task from 06 and forward. The list has to be one gene family identifier per line.
Gene family ids have to refer to existing ones in the database, and therefore can only be defined after the running of task 02.
It is therefore advised to first run the pipeline up to task 02 (or equally up to 05) without this option,
and then to to set this paramter for the downstream computations.
This can be done by editing the value of 'genefamlist' variable in the configuration file or by using:
pantagruel -i configfile --refresh -g genelist init (note it is important that -g option be placed after the --refresh option)
Reverting to the exhaustive computation behavior can be done similarly by setting 'genefamlist' variable to an empty value or by using:
pantagruel -i configfile --refresh -g '' init
-p|--pref_genomes Comma-spearated list of genome codes to use as preferred representative in the listing of genes in clade-specific gene lists during task 08
Output: Gene co-evolution options:"
-q|--max_event_age Older relative age on the species tree (real value between 0.0 = tips and 1.0 = root) under which events will considered to compute co-evolution scores"
and to build the gene co-evolution network. Deeper branches of the species tree are often long and aglomerate long evolutionary periods into one time point."
As a result, gene histories involving old events mapped to these deep branches will be more likely to correlate in an unspecific way."
Default value is 0.5, meaning events older than half the height of the ultrametric species tree are not considered for co-evolution scoring.
# for any Pantagruel command calls:
-h|--help print this help message and exit.
Tasks 06|genetrees
and 07|reconciliations
are computationally intensive due to the use of Bayesian algorithm, and due to the sheer number of homologous gene families for which a tree and a an evolution scenario need to be computed.
Thankfully, most academic institutions will nowadays give you access to a HPC cluster, that provides:
- high-efficience compute nodes for demanding tasks (reconciliation can be very memory hungry, up to above 100GB for complex gene family scenarios and large species trees);
- an interface to submit many similar individual jobs as arrays of jobs. The structure of data handled by Pantagruel - many gene families expecting the same computational treatment - lend themselves perfectly to this sort of computing infracstructure.
It is therefore highly recomended to use HPC clusters to deal with these intensive tasks if you can and have a dense dataset (or want it done with quickly).
For this sake, Pantagruel package provides an alternative to the main interface, using shell scripts for submission of jobs to the HPC cluster. So far, only Torque/PBS and IBM LSF cluster systems are suported. Tasks are broken down into steps, as every step within tasks need to be completed for all gene families (at least those you want to include in dowstream analyses) before things are carried forward.
Here is how to proceed:
- first, you should refresh your configuration file so to include the HPC parameters through the
-H
option:
pantagruel -i previous_config_file --refresh -H hpchost:/where/you/will/set/your/database init
- then run task
06
as you would normally do:
pantagruel -i previous_config_file 06
This will generate the list of gene families for which trees are to be computed, then send the files to the designated location on the HPC server.
This copy step will only work provided you can connect via ssh
to that host and that it does not require specifying the remote user account or to type in a password, i.e. that the simple command ssh hpchost
would get you logged in your account without prompting you for a password.
This is easily achieved i) using secure (e.g. RSA) key pairs to log-in between local and remote hosts (see SSH documentation) and using no passphrase (which is more secure anyway)
and ii) by using a SSH config file (located in our local home folder as ~/.ssh/config
, see ssh_config documetation) to describe the log-in details (remote user name, etc.)
the task will then stop short and tell you what to do next:
please connect to remote host hpchost and execute the following scripts in order
(waiting for completion of all array jobs submitted by one script before executing the next):
- pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh [OPTIONS] ptg_config_file
- pantagruel_pipeline_06.2_HPC_collapse_gene_trees.sh [OPTIONS] ptg_config_file
- pantagruel_pipeline_06.3_HPC_bayesian_gene_trees.sh [OPTIONS] ptg_config_file
- pantagruel_pipeline_06.4_HPC_replace_spe_by_pops.sh [OPTIONS] ptg_config_file
- pantagruel_pipeline_06.5_HPC_populate_db_collapsed_clades.sh [OPTIONS] ptg_config_file
then copy back ouput files and updated database file by syncing the root folder from remote host to this host
Of course, Pantagruel should be installed on the HPC host!
Logging onto the HPC host, you should visit the folder where you copied your database, and open the database's configuration file to edit the value of the $ptgrepo
environment variable so to reflect where the pantagruel git repository has been cloned on that host.
Then, you can run the scripts sequentially (waiting for full completion in-between each step!) as indicated above.
Note that parameters relevant to the HPC submission can be specified for these scripts using options (each script comes with its own set of default values, for instance regarding requested resources like max memory allowance on the compute node):
ptgrepo=/where/you/cloned/pantagruel
ptgscripts=${ptgrepo}/scripts
# to see the options:
${ptgscripts}/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh --help
# example of options, to specify that you will use the LSF system and request 32GB and 4 CPUs on each compute node, and 24h of maximum walltime use of the node:
${ptgscripts}/pantagruel_pipeline_06.1_HPC_full_ML_gene_trees.sh --mem 32 --ncpus 4 --wth 24 --hpctype 'LSF' ptg_config_file
This bioinformatic pipeline relies on a quite a few other pieces of software.
To install them, please follow the indications in the INSTALL page.
Main options are to use either:
- the automated install_dependencies.sh shell script (for Debian systems; tested on Ubuntu 18.04)
- the Dockerfile to generate a Docker image containing all dependencies; use that image to run
pantagruel
commmands within the environment provided by the container - pull a docker image from the repository at Quay.io