This repository provides functionality to compute the binding residue similarity of sequences for specific protein families (FunFams, EC) and to calculate consensus predictions for proteins within one FunFam.
To calculate consensus predictions, binding residue predictions have to be calculate for all sequences within one FunFam. Residues that are predicted as binding for at least x% of the sequences are then considered as binding according to the consensus prediction with x being a parameter the user can choose.
As an example, BindPredict-CC and BindPredict-CCS that predict binding residues from evolutionary couplings using clustering coefficients or cumulative coupling scores are provided. However, binding residues can be predicted using any other method available.
FunFams [1] alignments can be obtained from the CATH webserver
- EVcouplings results (*.di_scores, *_CouplingScoresCompared_all.csv, *_frequencies.csv, *_alignment_statistics.csv): EVcouplings [2,3] is available as a Github repository. A detailed description on how to run EVcouplings can be found here. EVcouplings only provides EC scores inferred by plmDCA. To calculate DI scores, one can use
- FreeContact [4] which can be downloaded as a debian package. Using the alignment generated by EVcouplings, DI scores can be calculated using the option "evfold".
- the EVcouplings webserver. DI scores can be calculated by entering the UniProt identifier or the sequence in FASTA format and by choosing "DI" as coupling scoring.
scores.py
can be used to calculate ccs and cc for a given set of proteins from evolutionary coupling results calculated using mfDCA and has the following command line parameters:
-evc_folder
path to a directory containing EVcouplings results for each protein-fasta_folder
path to a directory containing FASTA sequences for each protein-id_file
path to a file with IDs for which ccs and cc should be calculated-ec
[evc|freecontact] parameter defining whether coupling scores were calculated using EVcouplings or Freecontact-out_folder
path to a directory where output should be written for (2 files per protein, one for ccs, one for cc)
similarity.py
has the following command line parameters:
-families
the path to a directory containing the FunFam dataset (with one sub-directory per superfamily).-sites
the path to a file with a mapping of UNIPROT IDs to binding site annotation (see /data).-groupby
[funfam|ec]
the way in which the sequences should be grouped for similarity computations.-limit
[funfam|ec] optional
the groups which should not occur multiple times within the group specified bygroupby
.-align
optional
the path to a directory in which data for the generation of multiple sequence alignments can be stored.-clustalw
optional
the command to call the external clustalw MSE tool, necessary only ifgroupby
== ec.
prediction.py
has the following command line parameters:
-consensus
the consensus cut-off at positions are classified as binding.-cc
the cut-off above which a position is classified as binding by its clustering coefficient.-ccs
the cut-off above which a position is classified as binding by its cumulative coupling score.-uniprot_ids
path to a file containing all UNIPROT IDs for which data is available, one id per line.-mapping
path to a file with a mapping of FunFams to UNIPROT IDs.-evc_info
path to a directory with output files from EVcouplings (_final.outcfg, .alignment_statistics.csv), FreeContact (.di) as well as bindPredict (.cum_scores, .cluster_coeff). Data for each UNIPROT ID ought to be in a seperate subdirectory.-funfam_data
path to a file in FASTA FunFam format including mapped binding sites for each entry.-families
the path to a directory containing the FunFam dataset (with one sub-directory per superfamily).-out
the path to a directory to which output files will be written.
[1] Sillitoe I, Cuff AL, Dessailly BH, Dawson NL, Furnham N, Lee D, Lees JG, Lewis TE, Studer RA, Rentzsch R: New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Research 2012, 41(D1):D490-D498.
[2] Marks, D. S., Colwell, L. J., Sheridan, R., Hopf, T. A., Pagnani, A., Zecchina, R., Sander, C. (2011). Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6(12): e28766.
[3] Hopf, T. A., Schärfe, C. P. I., Rodrigues, J. P. G. L. M., Green, A. G., Kohlbacher, O., Sander, C., Bonvin, A. M. J. J., Debora S Marks, D.S. (2014) Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife; 3:e03430
[4] Kaján L., Hopf T. A., Kalaš M., Marks D. S., Rost B. (2014) FreeContact: fast and free software for protein contact prediction from residue co-evolution.. BMC Bioinformatics 15:58