Yet Another 16S rRNA database

ya16sdb is a pipeline for downloading, curating, and annotating a database of bacterial 16S rRNA sequences. This repository also implements a web application (https://ya16sdb.labmed.uw.edu/) that can be used to visualize the distance-based relationships among sequences for a given species.

The purpose of the project is to provide a high quality source of bacterial 16S rRNA sequences that is up to date with NCBI, in a format that is useful as an input for various bioinformatics pipleines such as blast searching, phylogenetic reference set creation, sequence-based taxonomic assignment, etc.

Project information

This project is a product of ongoing research interests of Noah Hoffman (https://faculty.washington.edu/ngh2/home/pages/software.html) at the University of Washington in the Department of Laboratory Medicine.

Christopher Rosenthal is the primary author of the pipeline.

The pipeline heavily relies on taxtastic (https://github.com/fhcrc/taxtastic) and deenurp (https://github.com/fhcrc/deenurp), both of which began as collaborations with Erick Matsen at The Fred Hutchinson Cancer Research Center in Seattle, WA.

Please cite this project as

Rosenthal C and Hoffman NG. 2019. ya16sdb: a pipeline for creating a collection of high-quality bacterial 16S rRNA sequences from NCBI. Version 0.6.1. University of Washington. https://github.com/nhoffman/ya16sdb

Overview

At a high level, this pipeline does the following:

Downloads annotation for all available sequence records from the NCBI matching search terms for 16S rRNA.
Retrieves sequence records for corresponding full length (or near full-length) 16S rRNA genes; this involves extracting subsequences from genome sequences or contigs.
Ensures that all records are 16S rRNA genes
Ensures that sequences are in a consistent orientation.
Identifies the taxonomic lineage of each record.
Annotates records as a "type strain" (according to NCBI's definition of type strain), "published" (annotation has an accompanying PubMed ID), "refseq" (belonging to the Genbank refseq collection), or "direct" (direct submissions).
Discards records likely to be mis-annotated using deenurp filter-outliers.
Provides various subsets of annotated sequences. Each record subset provides sequence metadata, sequences, taxonomic lineages, and a blast database. For example:
- only records with taxonomic name consistent with species-level classifications
- type strains only
- outliers removed
- downsampled to a subset of sequences for each species, prioritizing type strains and "published" records.
Stores record annotations in a single database table feather file

Database feather file

Record annotations are stored in a single table database feather file with the following columns and datatypes:

extract_genbank.py

seqname	string
version	string
accession	string
name	string
description	string
tax_id	string
modified_date	datetime
download_date	datetime
version_num	string
source	string
keywords	string
organism	string
length	int
ambig_count	int
strain	string
mol_type	string
isolate	string
isolation_source	string
seq_start	int
seq_stop	int
16s_start	int
16s_stop	int
master	string
locus_tag	string
old_locus_tag	string

taxonomy.py

species	string
genus	string
species_name	string
genus_name	string

is_type.py

is_type

bool

is_published.py

is_published

bool

is_refseq.py

is_refseq

bool

is_valid.py

is_valid

bool

confidence.py

confidence

string

ani.py

assembly_genbank	string
assembly_refseq	string
declared-type-ANI	string
declared-type-qcoverage.	string
best-match-type-assembly	string
best-match-species-taxid	string
best-match-species-name	string
best-match-type-category	string
best-match-type-ANI	string
best-match-type-qcoverage	string
taxonomy-check-status	string

filter_outliers.py

seqhash	string
centroid	string
dist	float
is_out	bool
cluster	float
x	float
y	float
filter_outliers	bool
dist_pct	float
rank_order	float

Docker

Docker image can be built with the following:

docker build --tag ya16sdb:latest .

Once a Docker image has been built a Singularity image can be built using the docker daemon:

singularity build ya16sdb.img docker-daemon://ya16sdb:latest

A Singularity image can also be built using a Singularity Docker container:

docker run --volume /var/run/:/var/run/ --volume $(pwd):$(pwd) --workdir $(pwd) singularity:latest build ya16sdb.img docker-daemon://ya16sdb:latest

Pipeline execution

The virtual containers have a predefined entry point to the SConstruct pipeline file.

To execute using Docker just a settings.conf file is required and can be run as follows:

docker run --volume $(pwd):$(pwd) --workdir $(pwd) ya16sdb:latest

And with Singularity

singularity run --bind $(pwd) --pwd $(pwd) ya16sdb.img

Name	Name	Last commit message	Last commit date
Latest commit crosenth Update README.rst Feb 4, 2025 521e5d0 · Feb 4, 2025 History 629 Commits
.github/workflows	.github/workflows	Running job on 15th of every month	Aug 15, 2024
bin	bin	Fixed execute permissions	Aug 15, 2024
dash	dash	Fixed scaleanchor warning	Apr 17, 2024
data	data	Some code refactoring	Jan 11, 2022
testfiles	testfiles	rearranged the loading of MEFETCH_ env vars	Aug 29, 2024
tests	tests	Added new unittest for testing test_output directory structure	Aug 13, 2024
.dockerignore	.dockerignore	Fixed GH Actions (#58 )	Jun 28, 2023
.gitignore	.gitignore	ignore dev folder	Jun 27, 2023
CHANGELOG.rst	CHANGELOG.rst	0.8.2 release working on Python:3.10	Jun 28, 2023
Dockerfile	Dockerfile	Setup GHA to run pipeline on testfiles and a unittest	Aug 15, 2024
README.rst	README.rst	Update README.rst	Feb 4, 2025
SConstruct	SConstruct	rearranged the loading of MEFETCH_ env vars	Aug 29, 2024
ansible.cfg	ansible.cfg	update to ansible 2.7	Oct 22, 2018
ncbi.conf	ncbi.conf	rearranged the loading of MEFETCH_ env vars	Aug 29, 2024
requirements-deploy.txt	requirements-deploy.txt	update to ansible 2.7	Oct 22, 2018
requirements.txt	requirements.txt	Upgrading sqlalchemy and pandas	Aug 1, 2024
settings-example.conf	settings-example.conf	Removed top level MEFETCH_ vars	Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yet Another 16S rRNA database

Project information

Overview

Database feather file

Docker

Pipeline execution

About

Releases

Packages 1

Contributors 5

Languages

nhoffman/ya16sdb

Folders and files

Latest commit

History

Repository files navigation

Yet Another 16S rRNA database

Project information

Overview

Database feather file

Docker

Pipeline execution

About

Resources

Stars

Watchers

Forks

Releases

Packages 1

Contributors 5

Languages