GitHub - neufeld/pandaseq: PAired-eND Assembler for DNA sequences

PANDASEQ is a program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.

INSTALLATION

Binary packages are available for recent versions of Windows, MacOS and Linux. Source code is also available. See Installation instructions for details.

Development packages for zlib and libbz2 are needed, as well as a standard compiler environment. On Ubuntu, this can be installed via:

sudo apt-get install build-essential libtool automake zlib1g-dev libbz2-dev pkg-config

On MacOS, the Apple Developer tools and Fink (or MacPorts or Brew) must be installed, then:

sudo fink install bzip2-dev pkgconfig

The newer AutoTools from Fink are needed over the ones provided by Apple, so ensure that Fink's bin directory precedes /usr/bin in the $PATH.

After the support packages are installed, one should be able to do:

./autogen.sh && ./configure && make && sudo make install

If you receive an error that libpandaseq.so.[number] is not found on Linux, try running:

sudo ldconfig

USAGE

Please consult the manual page by invoking:

man pandaseq

or visiting online PANDAseq manual page.

The short version is:

pandaseq -f forward.fastq -r reverse.fastq

REPORTING BUGS

Before filing a bug, consult how to file a bug.

Please run:

curl https://raw.github.com/neufeld/pandaseq/master/pandabug | sh

or

wget -O- https://raw.github.com/neufeld/pandaseq/master/pandabug | sh

to create a header with basic details about your system. Please include:

The output of the above script.
The exact error message. If this is a compilation error, do not truncate the output. If this is a problem when assembling, keep the INFO ARG lines, and the last few lines, but you may truncate the middle.
If you have tried multiple different things, please list them all.
Your sequencing data may be requested. This usually does not necessitate all the reads.

BINDING

PANDAseq may be used in other programs via a programmatic interface. Consult the header file pandaseq.h for more details. The C interface is pseudo-object oriented and documented in the header. The library provides pkg-config information, so compiling against it can be done using something like:

cc mycode.c `pkg-config --cflags --libs pandaseq-2`

or using, in configure.ac:

PKG_CHECK_MODULES(PANDASEQ, [ pandaseq-2 >= 2.5 ])

Other language bindings are welcome.

FAQ

Can I insist that PANDAseq only assemble perfect sequences?

Yes, but you shouldn't want to do it. The whole point is to fix sequences which are probably good. There is no quality setting that will achieve this effect. You can use the plugin completely_miss_the_point, but this really does miss the point. Moreover, assuming that the sequencer is right in the overlap region and in the non-overlapping regions requires an unsound leap in statistics.

Can I use SAM/BAM files as input without converting them to FASTQ?

Yes. PANDAseq-sam extends PANDAseq to do this. SAM/BAM files do not guarantee that sequences will be in the right order, so using SAM/BAM files may be slower and PANDAseq will use more memory.

The scores of the output bases seem really low. What's wrong?

Nothing. The quality scores of the output do not have any similarity to the original quality scores and are not uniform across the sequence (i.e., the overlap is scored differently from the unpaired ends.

In the overlap region where there is a mismatch, it is probable that one base was sequenced correctly and the other was sequenced incorrectly. If both bases have high scores (i.e., are probably correct), the chance of the resulting base is low (i.e., is probably incorrect). For more information, see the paper. Also, remember that the PHRED to probability conversion is not linear, so most scores are relatively high. It's also not uncommon to see the PHRED score !, which is zero, but in this context, it means less than " (PHRED = 1, P = .20567).

Again, these scores are not meant to be interpreted as regular scores and should not be processed by downstream applications expecting PHRED scores from Illumina sequences.

The scores of the non-overlapping regions are not the same as the original reads. Why?

The PHRED scores from the input are not copied directly to the output when using FASTQ (-F) output. They go through a transformation from PHRED scores into probabilities, which is what PANDAseq uses. When output as FASTQ, the probabilities are converted back to PHRED scores. The rounding error involved can cause a score to jump by one.

How many sequences should there be in the output?

You should expect that PANDAseq will output fewer sequences than the read pairs given to it. The log contains several STAT lines that will help with the analysis. Lines containing STAT READS report the number of read pairs in the input. Sequences first go through a number of basic filtering steps and then user-specified filtering steps. If provided, forward and reverse primers are aligned and clipped. The optimal overlap is selected and the sequence is constructed. The quality score is verified and any user-specified filtering is done. Any of these steps might fail and cause the sequence to be rejected. For each of the possible rejection reasons, the log file will contain a STAT line reporting the number of sequences filtered, as is described in the Output Statistics section of the manual.

If multiple threads are used, which the default on most platforms, each thread collects this information separately. The output log will output a group of STAT lines per thread.

The STAT SLOW line is informative; those sequences were not rejected. The other STAT lines (i.e., not READS or SLOW) should sum to the STAT READS line.

ALTERNATIVES

Similar algorithms (i.e., determine the overlap, then fuse the reads):

COPE (Connecting Overlapped Pair-End reads) – Algorithm similar to FLASH
FastqJoin in ea-utils – Algorithm included in PANDAseq
FLASH (Fast Length Adjustment of SHort reads) – Algorithm included in PANDAseq
PEAR (Paired-End AssembleR) – Algorithm included in PANDAseq
stitch
XORRO (Rapid Pair-end Read Overlapper)
leeHom

Completely different methods:

SeqPrep – Uses alignment

CITATION

Andre P Masella, Andrea K Bartram, Jakub M Truszkowski, Daniel G Brown and Josh D Neufeld. PANDAseq: paired-end assembler for illumina sequences. BMC Bioinformatics 2012, 13:31. http://www.biomedcentral.com/1471-2105/13/31

Name	Name	Last commit message	Last commit date
Latest commit richardkmichael and apmasell Update pandaseq.1 Dec 21, 2020 5eb701b · Dec 21, 2020 History 561 Commits
debian	debian	Release 2.11	Mar 3, 2017
m4	m4	Try to clean up configure script	May 15, 2014
testing	testing	Release v2.8	Nov 6, 2014
.gitignore	.gitignore	Add quality score to the end of the sequence name	May 23, 2018
.indent.pro	.indent.pro	Add more type to indent file	Mar 11, 2014
.travis.yml	.travis.yml	Push Autogenerated Man Content and Deploy into Google Cloud	Jul 5, 2018
CHANGES	CHANGES	Point a CHANGES file at the Debian changelog	Nov 18, 2013
COPYING	COPYING	Created C version with module system	Jan 29, 2012
Makefile.am	Makefile.am	Add unit tests for header parser	Sep 20, 2016
README	README	Include a symlink to README to keep AutoTools happy	Oct 12, 2013
README.md	README.md	Push Autogenerated Man Content and Deploy into Google Cloud	Jul 5, 2018
algo.c	algo.c	Added UPARSE algorithm	Nov 6, 2014
algo.h	algo.h	Move scoring algorithm to a separate module	Nov 2, 2013
algo_ea_util.c	algo_ea_util.c	Log-scale the scores	Jul 31, 2014
algo_example.c	algo_example.c	Make algorithm list able to handle new plugins	Dec 24, 2013
algo_flash.c	algo_flash.c	Fixes from CLANG static analyzer	May 15, 2015
algo_pear.c	algo_pear.c	Compile cleanly with all the warnings	Jul 17, 2014
algo_rdp_mle.c	algo_rdp_mle.c	Compile cleanly with all the warnings	Jul 17, 2014
algo_simple_bayes.c	algo_simple_bayes.c	Compile cleanly with all the warnings	Jul 17, 2014
algo_stitch.c	algo_stitch.c	Log-scale the scores	Jul 31, 2014
algo_uparse.c	algo_uparse.c	Fix UPARSE scoring algorithms	Nov 11, 2014
args.c	args.c	Bump size of getopt array	Mar 27, 2017
args_array.c	args_array.c	Clean up includes	Dec 24, 2013
args_assembler.c	args_assembler.c	Add -D flag to add a penalty for shifting primers	Nov 6, 2014
args_fastq.c	args_fastq.c	Allow reading separate index reads	Mar 13, 2015
args_hang.c	args_hang.c	Compile cleanly with all the warnings	Jul 17, 2014
assembler.c	assembler.c	Sanity check sequence length after primer stripping	Nov 6, 2014
assembler.h	assembler.h	Add -D flag to add a penalty for shifting primers	Nov 6, 2014
assembler_support.c	assembler_support.c	Allow reading separate index reads	Mar 13, 2015
async.c	async.c	Fixes from CLANG static analyzer	May 15, 2015
autogen.sh	autogen.sh	Created C version with module system	Jan 29, 2012
buffer.c	buffer.c	Fix memory leak in static_buffer	Feb 13, 2014
buffer.h	buffer.h	Fixed bug in non-pthreaded buffer allocation	Apr 19, 2013
buffer.list	buffer.list	Move mux to having dedicated buffers	May 23, 2013
build-macos-pkg.in	build-macos-pkg.in	Packaging scripts for MacOS	Apr 26, 2013
bzstream.c	bzstream.c	Addes streaming BZip decompressor for cURL	Jan 10, 2014
check_parser.c	check_parser.c	Add unit tests for header parser	Sep 20, 2016
configure.ac	configure.ac	Release 2.11	Mar 3, 2017
curl_reader.c	curl_reader.c	Fixes from CLANG static analyzer	May 15, 2015
deps-url.in	deps-url.in	Add dependency file for Vala binding for pandaseq-url	Nov 23, 2013
diff.c	diff.c	Fixes from CLANG static analyzer	May 15, 2015
fastq.c	fastq.c	Skip read pairs with no sequence	Jun 16, 2015
fileio.c	fileio.c	Allow reading separate index reads	Mar 13, 2015
hang.c	hang.c	Add -D flag to add a penalty for shifting primers	Nov 6, 2014
idset.c	idset.c	Added PandaIdFmt to formatting types and reformatted	Aug 29, 2013
iter.c	iter.c	More warning-induced cleanup (from Windows and no threads)	Jul 18, 2014
lib.rc	lib.rc	Separate product and file versions in Windows.	Jan 7, 2014
linebuf.c	linebuf.c	Prevent memory corruption when reading blank lines	Jun 12, 2015
main-diff.c	main-diff.c	Add a program to compare differening conditions	Feb 1, 2014
main-hang.c	main-hang.c	Clean up includes	Dec 24, 2013
main-parse.c	main-parse.c	Change run from int to string	Nov 24, 2014
main.c	main.c	Clean up includes	Dec 24, 2013
misc.c	misc.c	Compile cleanly with all the warnings	Jul 17, 2014
misc.h	misc.h	More warning-induced cleanup (from Windows and no threads)	Jul 18, 2014
mktable.c	mktable.c	Flip probabilities in UPARSE tables	Nov 11, 2014
module.c	module.c	Remove commented-out code	Mar 17, 2017
module.h	module.h	List all known modules when the help is invoked	Dec 20, 2013
mux.c	mux.c	Allow reading separate index reads	Mar 13, 2015
nt.c	nt.c	Encode complements using symbolic constants	Mar 21, 2017
nt.h	nt.h	Formatted and reorganised code	Sep 26, 2012
offset.c	offset.c	Add -D flag to add a penalty for shifting primers	Nov 6, 2014
output.c	output.c	Do not output empty sequences	Aug 30, 2018
panda_api.c	panda_api.c	Formatted and reorganised code	Sep 26, 2012
pandabug	pandabug	Added bug-filing script	Mar 4, 2013
pandaseq-algorithm.h	pandaseq-algorithm.h	Added UPARSE algorithm	Nov 6, 2014
pandaseq-args.h	pandaseq-args.h	Add -D flag to add a penalty for shifting primers	Nov 6, 2014
pandaseq-assembler.h	pandaseq-assembler.h	Add -D flag to add a penalty for shifting primers	Nov 6, 2014
pandaseq-checkid.1	pandaseq-checkid.1	Thorough review and update of all man pages	Mar 6, 2013
pandaseq-common.h	pandaseq-common.h	Allow parsing converted CASAVA headers	Sep 20, 2016
pandaseq-diff.1	pandaseq-diff.1	Add a man page for pandaseq-diff	Feb 1, 2014
pandaseq-hang.1	pandaseq-hang.1	Added pandaseq-hang manual page	May 24, 2013
pandaseq-iter.h	pandaseq-iter.h	Split up the massive header file into manageable pieces	May 26, 2013
pandaseq-linebuf.h	pandaseq-linebuf.h	Move to buffered reads to improve performance	Aug 15, 2013
pandaseq-log.h	pandaseq-log.h	Add a perror-like function to log proxy	Dec 19, 2013
pandaseq-module.h	pandaseq-module.h	Make plugins have no static state	Jan 24, 2014
pandaseq-mux.h	pandaseq-mux.h	Detect compression automatically.	Jan 1, 2014
pandaseq-nt.h	pandaseq-nt.h	Add nucleotide complement function	Oct 31, 2014
pandaseq-plugin.h	pandaseq-plugin.h	Make plugins have no static state	Jan 24, 2014
pandaseq-seqid.h	pandaseq-seqid.h	Add support for EBI SRA header formats	Mar 8, 2014
pandaseq-set.h	pandaseq-set.h	Added PandaIdFmt to formatting types and reformatted	Aug 29, 2013
pandaseq-tablebuilder.h	pandaseq-tablebuilder.h	Add documentation. Everyone loves documentation.	Sep 3, 2013
pandaseq-url.h	pandaseq-url.h	Addes streaming BZip decompressor for cURL	Jan 10, 2014
pandaseq-writer.h	pandaseq-writer.h	Add a discarding writer	Jan 10, 2014
pandaseq.1	pandaseq.1	Update pandaseq.1	Dec 21, 2020
pandaseq.h	pandaseq.h	Allow reading separate index reads	Mar 13, 2015
pandaseq.spec.in	pandaseq.spec.in	Fix spec for building RPMs	Aug 9, 2014
pandaseq.svg	pandaseq.svg	Added logo and made README markdown	Oct 9, 2013
pandaxs.1	pandaxs.1	Thorough review and update of all man pages	Mar 6, 2013
pandaxs.in	pandaxs.in	For -module flag in pandaxs	Nov 18, 2015
pc-url.in	pc-url.in	Add a URL data source	Nov 22, 2013
pc.in	pc.in	Make library naming more automatic	Dec 6, 2012
plugin_after.c	plugin_after.c	Compile cleanly with all the warnings	Jul 17, 2014
plugin_before.c	plugin_before.c	Compile cleanly with all the warnings	Jul 17, 2014
plugin_completely_miss_the_point.c	plugin_completely_miss_the_point.c	Compile cleanly with all the warnings	Jul 17, 2014
plugin_empty.c	plugin_empty.c	Fixes from CLANG static analyzer	May 15, 2015
plugin_filter.c	plugin_filter.c	Compile cleanly with all the warnings	Jul 17, 2014
plugin_min_overlapbits.c	plugin_min_overlapbits.c	Compile cleanly with all the warnings	Jul 17, 2014
plugin_min_phred.c	plugin_min_phred.c	Compile cleanly with all the warnings	Jul 17, 2014
plugin_other_primer.c	plugin_other_primer.c	Use unused variables in other_primer plugin	Apr 26, 2016
plugin_overlap_stat.c	plugin_overlap_stat.c	Silence unused parameter.	Jul 18, 2014
plugin_pear_test.c	plugin_pear_test.c	Compile cleanly with all the warnings	Jul 17, 2014
plugin_sample.c	plugin_sample.c	Make plugins have no static state	Jan 24, 2014
plugin_validtag.c	plugin_validtag.c	Compile cleanly with all the warnings	Jul 17, 2014
pool.c	pool.c	Fixes from CLANG static analyzer	May 15, 2015
prob.h	prob.h	Migrate PHREDCLAMP to header file	Aug 29, 2013
proxy.c	proxy.c	Add a perror-like function to log proxy	Dec 19, 2013
seqid.c	seqid.c	Allow parsing converted CASAVA headers	Sep 20, 2016
tablebuilder.c	tablebuilder.c	Compile cleanly with all the warnings	Jul 17, 2014
vapi-url.in	vapi-url.in	Fix cheader_filename in pandaseq-url VAPI	Jul 31, 2014
vapi.in	vapi.in	Fix taglets in VAPI	Jun 29, 2016
writer.c	writer.c	Fixes from CLANG static analyzer	May 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INSTALLATION

USAGE

REPORTING BUGS

BINDING

FAQ

Can I insist that PANDAseq only assemble perfect sequences?

Can I use SAM/BAM files as input without converting them to FASTQ?

The scores of the output bases seem really low. What's wrong?

The scores of the non-overlapping regions are not the same as the original reads. Why?

How many sequences should there be in the output?

ALTERNATIVES

CITATION

About

Releases 7

Packages

Contributors 6

Languages

License

neufeld/pandaseq

Folders and files

Latest commit

History

Repository files navigation

INSTALLATION

USAGE

REPORTING BUGS

BINDING

FAQ

Can I insist that PANDAseq only assemble perfect sequences?

Can I use SAM/BAM files as input without converting them to FASTQ?

The scores of the output bases seem really low. What's wrong?

The scores of the non-overlapping regions are not the same as the original reads. Why?

How many sequences should there be in the output?

ALTERNATIVES

CITATION

About

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 6

Languages

Packages