L. Salmela and E. Rivals
LoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.
Usually, errors in PacBio reads include many insertions and deletions, and comparatively less substitutions. LoRDEC can correct errors of all these types.
After correction, a larger portion of the sequence of PacBio reads is usable for detection of region of similarity with other sequences, for aligning them to the contigs of an assembly, etc.
- Why is LoRDEC different?
- It is efficient and can process large read data sets, included from eukaryotic or vertebrate species, on a usual computing server, and even works on desktop/laptop computers.
- It adopts a novel graph based approach: it builds a succinct De Bruijn Graph (DBG) representing the short reads, and seeks a corrective sequence for each erroneous region of a long read by traversing chosen paths in the graph.
- Input and output
The inputs read sets are in FASTA or FASTQ format. The reference read set can be compressed (more exactly gzipped).
The output is the set of corrected reads also in FASTA format. In these corrected sequences: uppercase symbol denote correct nucleotides, while lowercase denote nucleotides left un-corrected.
The correction program needs also two parameters when it is called (so 5 information altogether, see its Usage below):
- the parameter, k, i.e. the length of the k-mers that are counted and used in the graph
- the solidity threshold, s, in other words a minimal number of occurrences of a k-mer such that it is assumed to be correct in Illumina reads.
For bacterial species or eukaryotic species with small genomes, you may choose k=19 or 17, and s=2 or 3. For species with larger genomes, k=21 and s=2 or 3.
LoRDEC contains several programs:
- lordec-correct: the main program for correcting the PacBio reads
- lordec-stats: for computing statistics about the PacBio reads
- lordec-trim: to trim in the corrected PacBio reads the parts at the beginning or end of the sequence that could not be corrected.
- lordec-trim-split: trim the corrected PacBio reads and split them into several parts if some internal region could not be corrected.
- lordec-build-SR-graph: builds the de Bruijn Graph from the FASTA file of short reads and saves it to a HD5 formatted file
Programs trim and trim-split take as input corrected PacBio reads.
We provide the source code, which is available in the following archive
To compile it you will need to fetch the GATB core library (see below).
Please consult the README file (text format) available in the source archive.
LoRDEC needs the GATB core library (currently in version gatb-core-1.1.0-Linux.tar.gz)
For installing, please download, go in the desired directory, and in a shell type:
- tar xzvf LoRDEC-0.4.1.tar.gz
Usage (changed at version 0.3)
The parameters on the commande line can be given in any order.
- For correcting the PacBio reads: lordec-correct
[--trials <number of target k-mers>]
[--branch <maximum number of branches to explore>]
[--errorrate <maximum error rate>]
[--threads <number of threads>]
-2 <FASTA/Q files> -k <k-mer size> -s <abundance threshold> -i <PacBio FASTA file> -o <output file corrected reads>
lordec-correct -2 illumina.fasta -k 19 -s 3 -i pacbio.fasta -o pacbio-corrected.fasta
- For computing statistics: lordec-stats
lordec-stats -2 <Short read FASTA/Q file> -k <k-mer size> -s <solid k-mer threshold> -i <PacBio FASTA/Q file> -S <output stat file> [-T <number of threads>]
- For trimming the corrected PacBio reads: lordec-trim
lordec-trim -i <corrected reads file> -o <trimmed reads file>
- For trimming and splitting the corrected PacBio reads: lordec-trim-split
lordec-trim-split -i <corrected reads file> -o <trimmed reads file>
- For building and saving the de Bruijn Graph of short reads: lordec-build-SR-graph
lordec-build-SR-graph [-T <number of threads>] -2 <FASTA file> -k <k-mer size> -s <solid k-mer threshold> -g <out graph file>