ukb_exomes

The results of this analysis are released in Google Cloud bucket gs://ukbb-exome-public/:

Main summary statistics MatrixTable:
- Variant-level results: gs://ukbb-exome-public/500k/results/variant_results.mt
- Gene-level results: gs://ukbb-exome-public/500k/results/results.mt
QC information annotated MatrixTable or Hail Table:
- Variant-level: gs://ukbb-exome-public/500k/qc/variant_qc_metrics_ukb_exomes_500k{.mt, .ht}
- Gene-level: gs://ukbb-exome-public/500k/qc/gene_qc_metrics_ukb_exomes_500k{.mt, .ht}
- Phenotype: gs://ukbb-exome-public/500k/qc/pheno_qc_metrics_ukb_exomes_500k.ht

We also provide the following derived datasets for convenience:

Gene-annotation group cumulative allele frequency table: gs://ukbb-exome-public/500k/qc/gene_caf_500k.ht

These files can be accessed by cloning this and the https://github.com/broadinstitute/ukbb_qc repo, import the ukbb_common python module and accessing them programmatically. We recommend using these functions, as they apply our QC metrics and include convenience metrics such as lambda GC.

%%bash
git clone https://github.com/broadinstitute/ukbb_qc
git clone https://github.com/Nealelab/ukb_exomes

from ukb_exomes import *
from ukbb_common import *

To read the original MatrixTables with 4529 phenotypes:

## Gene-level results
gene_mt = hl.read_matrix_table(get_results_mt_path(result_type='gene'))
## Variant-level results
var_mt = hl.read_matrix_table(get_results_mt_path(result_type='variant'))

To read the full MatrixTables with QC information annotated:

## Gene-level results
gene_mt = load_final_sumstats_table(result_type='gene', extension="mt")
## Variant-level results
var_mt = load_final_sumstats_table(result_type='variant', extension="mt")

To get the final QCed MatrixTables (Note that we have two options for test_type: skato and burden, which indicates which test the lambda GC used here were computed from):

## Gene-level results
gene_mt = get_qc_result_mt(result_type="gene", test_type="skato")
## Variant-level results
var_mt = get_qc_result_mt(result_type="variant", test_type="skato")

The basic summary statistics of the gene-based tests have the following schema:

----------------------------------------
Global fields:
    'coverage_min': int32
    'expected_AC_min': int32
    'n_var_min': int32
    'gene_syn_lambda_min': float64
    'pheno_lambda_min': float64
----------------------------------------
Column fields:
    'n_cases': int32
    'n_controls': int32
    'heritability': float64
    'saige_version': str
    'inv_normalized': str
    'trait_type': str
    'phenocode': str
    'pheno_sex': str
    'coding': str
    'modifier': str
    'n_cases_defined': int64
    'n_cases_both_sexes': int64
    'n_cases_females': int64
    'n_cases_males': int64
    'description': str
    'description_more': str
    'coding_description': str
    'category': str
    'expected_ac_col_filter': int64
    'lambda_gc_skat': float64
    'lambda_gc_burden': float64
    'lambda_gc_skato': float64
    'keep_pheno_skato': bool
    'keep_pheno_skat': bool
    'keep_pheno_burden': bool
    'keep_pheno_unrelated': bool
----------------------------------------
Row fields:
    'gene_id': str
    'gene_symbol': str
    'annotation': str
    'interval': interval<locus<GRCh38>>
    'markerIDs': str
    'markerAFs': str
    'total_variants': int32
    'Nmarker_MACCate_1': int32
    'Nmarker_MACCate_2': int32
    'Nmarker_MACCate_3': int32
    'Nmarker_MACCate_4': int32
    'Nmarker_MACCate_5': int32
    'Nmarker_MACCate_6': int32
    'Nmarker_MACCate_7': int32
    'Nmarker_MACCate_8': int32
    'CAF': float64
    'mean_coverage': float64
    'expected_ac_row_filter': int64
    'continuous_lambda_gc_skato': float64
    'continuous_lambda_gc_skat': float64
    'continuous_lambda_gc_burden': float64
    'categorical_lambda_gc_skato': float64
    'categorical_lambda_gc_skat': float64
    'categorical_lambda_gc_burden': float64
    'icd10_lambda_gc_skato': float64
    'icd10_lambda_gc_skat': float64
    'icd10_lambda_gc_burden': float64
    'annotation_lambda_gc_skato': float64
    'annotation_lambda_gc_skat': float64
    'annotation_lambda_gc_burden': float64
    'synonymous_lambda_gc_skato': float64
    'synonymous_lambda_gc_skat': float64
    'synonymous_lambda_gc_burden': float64
    'keep_gene_skato': bool
    'keep_gene_skat': bool
    'keep_gene_burden': bool
    'keep_gene_coverage': bool
    'keep_gene_expected_ac': bool
    'keep_gene_n_var': bool
----------------------------------------
Entry fields:
    'Pvalue': float64
    'Pvalue_Burden': float64
    'Pvalue_SKAT': float64
    'BETA_Burden': float64
    'SE_Burden': float64
    'Pvalue.NA': float64
    'Pvalue_Burden.NA': float64
    'Pvalue_SKAT.NA': float64
    'BETA_Burden.NA': float64
    'SE_Burden.NA': float64
    'total_variants_pheno': int32
    'expected_AC': float64
    'keep_entry_expected_ac': bool
----------------------------------------
Column key: ['trait_type', 'phenocode', 'pheno_sex', 'coding', 'modifier']
Row key: ['gene_id', 'gene_symbol', 'annotation']
----------------------------------------

The basic summary statistics of the variant-level tests have the following schema:

----------------------------------------
Global fields:
    'expected_AC_min': int32
    'pheno_lambda_min': float64
----------------------------------------
Column fields:
    'n_cases': int32
    'n_controls': int32
    'heritability': float64
    'saige_version': str
    'inv_normalized': str
    'trait_type': str
    'phenocode': str
    'pheno_sex': str
    'coding': str
    'modifier': str
    'n_cases_defined': int64
    'n_cases_both_sexes': int64
    'n_cases_females': int64
    'n_cases_males': int64
    'description': str
    'description_more': str
    'coding_description': str
    'category': str
    'expected_ac_col_filter': int64
    'lambda_gc_skat': float64
    'lambda_gc_burden': float64
    'lambda_gc_skato': float64
    'keep_pheno_skato': bool
    'keep_pheno_skat': bool
    'keep_pheno_burden': bool
    'keep_pheno_unrelated': bool
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'markerID': str
    'gene': str
    'annotation': str
    'call_stats': struct {
        AC: int32,
        AF: float64,
        AN: int32,
        homozygote_count: int32
    }
    'expected_ac_row_filter': int64
    'keep_var_expected_ac': bool
    'keep_var_annt': bool
----------------------------------------
Entry fields:
    'AC': int32
    'AF': float64
    'BETA': float64
    'SE': float64
    'AF.Cases': float64
    'AF.Controls': float64
    'Pvalue': float64
    'expected_AC': float64
    'keep_entry_expected_ac': bool
----------------------------------------
Column key: ['trait_type', 'phenocode', 'pheno_sex', 'coding', 'modifier']
Row key: ['locus', 'alleles']
----------------------------------------

Name	Name	Last commit message	Last commit date
Latest commit wlu04 Update README.md Dec 2, 2022 4571dd4 · Dec 2, 2022 History 289 Commits
R	R	create PDFs	May 4, 2022
hail	hail	all updates	Nov 16, 2022
resources	resources	all updates	Nov 16, 2022
utils	utils	Update results.py	Dec 2, 2022
.gitignore	.gitignore	adding gitignore	Dec 18, 2019
README.md	README.md	Update README.md	Dec 2, 2022
__init__.py	__init__.py	missing inits	Dec 18, 2019
saige_exomes.py	saige_exomes.py	all updates	Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ukb_exomes

About

Releases 1

Packages

Contributors 3

Languages

Nealelab/ukb_exomes

Folders and files

Latest commit

History

Repository files navigation

ukb_exomes

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages