RAPPAS: Rapid alignment-free phylogenetic identification of metagenomic sequences

Benjamin Linard, Krister Swenson, Fabio Pardi

Motivation

Taxonomic classification is at the core of environmental DNA analysis. When a phylogenetic tree can be built as a prior hypothesis to such classification, phylogenetic placement (PP) provides the most informative type of classification because each query sequence is assigned to its putative origin in the tree. This is useful whenever precision is sought (e.g. in diagnostics). However, likelihood-based PP algorithms struggle to scale with the ever-increasing throughput of DNA sequencing.

Alignment-free phylogenetic placement

RAPPAS uses an alignment-free approach, removing the hurdle of query sequence alignment as a preliminary step to PP. Its approach relies on the precomputation of a database of k-mers that may be present with non-negligible probability in relatives of the reference sequences. The placement is performed by inspecting the stored phylogenetic origins of the k-mers in the query, and their probabilities.
The database can be reused for the analysis of several different metagenomes. Experiments show that the first implementation of RAPPAS is already faster than competing likelihood-based PP algorithms, while keeping similar accuracy for short reads. RAPPAS scales PP for the era of routine metagenomic diagnostics.
rappas_overview
Figure 1: Comparison between previous phylogenetic placement software and RAPPAS.
The pipelines from query sequence datasets to placement results (jplace files) are depicted. A. Likelihood-based software requires, for each new query dataset, the step of aligning the query sequences to the reference alignment via an external tool (red box). The resulting extended alignment is the input for the phylogenetic placement itself (blue box).B. RAPPAS builds a database of k-mers (the pkDB) once for a given reference tree and alignment. Many query datasets can then be placed without alignment, matching their k-mer content to this database. Each operation is run with a seperate call to RAPPAS (blue boxes).

Downloads

Intructions to download and build RAPPAS are available on its GitHub page.
There is also a Wiki with tutorials, test datasets, discussion about phylogenetic placement ...