Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How updated is PHEN column in the result sample.RARE_PASS_GENE.xlsx file? #79

Open
nswh opened this issue Nov 20, 2024 · 3 comments
Open

Comments

@nswh
Copy link

nswh commented Nov 20, 2024

The PHEN column in the result sample.RARE_PASS_GENE.xlsx file is amalgamated information of OMIM disease gene and Orphanet nomenclature of rare diseases. Here is an example:

GENES	PHEN
MPZ,SDHC	MPZ: MIM -  ROUSSY-LEVY HEREDITARY AREFLEXIC DYSTASIA;  CHARCOT-MARIE-TOOTH DISEASE, AXONAL, TYPE 2I; CMT2I;  CHARCOT-MARIE-TOOTH DISEASE, DEMYELINATING, TYPE 1B; CMT1B;  NEUROPATHY, CONGENITAL HYPOMYELINATING OR AMYELINATING, AUTOSOMAL;  CHARCOT-MARIE-TOOTH DISEASE, AXONAL, TYPE 2J; CMT2J;  HYPERTROPHIC NEUROPATHY OF DEJERINE-SOTTAS;  CHARCOT-MARIE-TOOTH DISEASE, DOMINANT INTERMEDIATE D; CMTDID;  ADIE PUPIL|SDHC: MIM -  PARAGANGLIOMAS 3; PGL3;  PARAGANGLIOMA AND GASTRIC STROMAL SARCOMA

I can see the amalgamated information in PHEN column is derived from clinsv reference data. A file named ensemble_GRCh37_2_phen.txt as shown below. Date is Nov 7 2019

├── refdata-b38
│   ├── annotation
│   │   ├── 1kG_estd219.bed.gz
│   │   ├── 1kG_estd219.bed.gz.tbi
│   │   ├── DGV_GRCh38_hg38_variants_2020-02-25.bed.gz
│   │   ├── DGV_GRCh38_hg38_variants_2020-02-25.bed.gz.tbi
│   │   ├── ensemble_GRCh37_2_phen.txt
│   │   ├── Homo_sapiens.GRCh38.99.gff.gz
│   │   ├── Homo_sapiens.GRCh38.99.gff.gz.tbi
│   │   ├── Hs-gene-labels.txt
│   │   ├── Hs-gene-to-phenotype.txt
│   │   ├── MGRB-SV.bed.gz
│   │   └── MGRB-SV.bed.gz.tbi

Could you clarify is Nov 7 2019 the time last updated of the OMIM and Orphanet information? And why the file name is as GRCh37 instead of GRCh38? Is it because of the transcripts and GENES are not changing between genome build? Also, Do you have an estimation of the next update of ensemble_GRCh37_2_phen.txt? If there is no plan of update, could you provide by the time you created ensemble_GRCh37_2_phen.txt, what were the resource files you download and what procedure you have gone through to make this file ensemble_GRCh37_2_phen.txt?

@J-Bradlee
Copy link
Collaborator

Hi @nswh

It looks like "ensemble_GRCh37_2_phen.txt" is generated from biomart on the ensembl website.

It selects these attributes:
Screenshot 2024-12-09 at 7 43 50 am

Then click on results and download the tsv version.

Two of the column names have changed from when ensemble_GRCh37_2_phen.txt was originally created (which is probably since Nov 7, 2019, as you mentioned):

  • Ensembl Gene ID -> Gene stable ID
  • Associated Gene Name -> Gene Name

Attached I have the tsv I made on the 2/12/24 following the procedure above, I have also re-mapped the changed column names back to the old column name. Google drive link to file.

There are significantly more annotations in this one 177k vs 32k.

ClinSV needs to update its annotation resources files. I can write up a script which could pull this annotation on the fly from biomart.

@J-Bradlee
Copy link
Collaborator

J-Bradlee commented Dec 11, 2024

biomart also gives the perl script which generates this TSV using their API (just click the perl tab).

Note: it doesn't do any of the mapping mention above which is required to keep the annoation file headers consistent:

  • Ensembl Gene ID -> Gene stable ID
  • Associated Gene Name -> Gene Name
# An example script demonstrating the use of BioMart API.
# This perl API representation is only available for configuration versions >=  0.5 
use strict;
use BioMart::Initializer;
use BioMart::Query;
use BioMart::QueryRunner;

my $confFile = "PATH TO YOUR REGISTRY FILE UNDER biomart-perl/conf/. For Biomart Central Registry navigate to
						http://www.biomart.org/biomart/martservice?type=registry";
#
# NB: change action to 'clean' if you wish to start a fresh configuration  
# and to 'cached' if you want to skip configuration step on subsequent runs from the same registry
#

my $action='cached';
my $initializer = BioMart::Initializer->new('registryFile'=>$confFile, 'action'=>$action);
my $registry = $initializer->getRegistry;

my $query = BioMart::Query->new('registry'=>$registry,'virtualSchemaName'=>'default');

		
	$query->setDataset("hsapiens_gene_ensembl");
	$query->addAttribute("ensembl_gene_id");
	$query->addAttribute("phenotype_description");
	$query->addAttribute("source_name");
	$query->addAttribute("external_gene_name");
	$query->addAttribute("study_external_id");
	$query->addAttribute("mim_gene_accession");
	$query->addAttribute("mim_morbid_description");
	$query->addAttribute("mim_gene_description");

$query->formatter("TSV");

my $query_runner = BioMart::QueryRunner->new();
############################## GET COUNT ############################
# $query->count(1);
# $query_runner->execute($query);
# print $query_runner->getCount();
#####################################################################


############################## GET RESULTS ##########################
# to obtain unique rows only
# $query_runner->uniqueRowsOnly(1);

$query_runner->execute($query);
$query_runner->printHeader();
$query_runner->printResults();
$query_runner->printFooter();
#####################################################################```

@drmjc
Copy link
Member

drmjc commented Jan 20, 2025

did this new file work for you @nswh ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants