As with most nascent research fields, microbial metagenomics has grappled not just with the question of how to measure and contextualize data but how to develop and refine tools to perform those analyses. While many tools have been created and used effectively, benchmarking their performance has proven especially difficult.
One of the critical issues in comparing metagenomics profilers is the fact that some are designed to report sequence abundance, while others reflect taxonomic abundance. These differences can lead to wildly divergent results when interpreting the analysis of a given dataset. Unfortunately, that distinction has often been ignored, which presents a significant roadblock to refining profiling tools.
An inter-institutional team led by Zheng Sun at Brigham and Women’s Hospital and Harvard Medical School, Shi Huang at the Center for Microbiome Innovation (CMI) at the University of California San Diego (UC San Diego), and Niina Haiminen at the IBM Thomas J. Watson Research Center set out to examine this problem in more depth. Their findings were published by Nature Methods in June in an article entitled “Challenges in Benchmarking Metagenomic Profilers.”
“One of the original motivations for this paper was to challenge how we understand taxonomic profiles. It seems as if the field both acknowledges that sequence counts are not representative of microbial cell counts and simultaneously ignores this fact when using taxonomic profilers,” according to CMI Associate Director of Bioinformatic Integration Yoshiki Vázquez-Baeza. “What’s worse is that many of the bioinformatic tools that are used to generate these taxonomic profiles don’t generally make this very clear.”
To explore how and why profilers were falling short or being misused, the group designed experimental scenarios using both standard tools, as well as hypothetical ideal profilers for both taxonomic and sequence abundance. They then assessed the resulting data from the standpoint of both per-sample summaries and cross-sample comparisons. As expected, conflating taxonomic and sequence abundance had a significant impact on results, but there were several surprising aspects.
“It was somewhat surprising to discover that the taxonomic and sequence abundance profiles were very different from each other for a broad range of distance measures used. And as it turns out, even with a perfect profiling method, the two resulting profiles differed drastically,” Haiminen commented when discussing their findings.
Moreover, specific types of analysis can yield exaggerated results. “We find that microbiome data analysis based on sequence abundance will underestimate (or overestimate) the relative abundance of microbes of smaller (or larger) genome size, respectively. If we identify a low abundance of viruses/bacterial phages using Kraken or other similar profilers, we should be aware of the potential abundance underestimation,” Huang explained. “Similarly, if a fungus shows high abundance in the data it could be overestimated, especially when calculating bacteria: fungi ratios.”
While the paper’s conclusions provide the groundwork for more focused and appropriate use of metagenomic profilers, it also enables researchers to assess past studies and, hopefully, refine their analysis. “This paper draws attention to existing work in microbiome literature that may require re-interpretation considering our results and offers guidance for future design and use of metagenomic profilers,” Haiminen added.
Following the work on understanding problems using existing solutions, the group is also building metagenomic profilers to enable more accurate data processing and analysis. “Our team has been developing metagenomic tools for taxonomic profiling, such as SHOGUN/Woltka. It is classified as a DNA-to-DNA profiler that typically produces sequence abundance profiles. According to this principle, we benchmark this tool with metagenomic profilers in the same category (Bracken, Kraken, etc.) in the new methodology paper,” Huang said while describing ongoing efforts to refine metagenomics analysis.
The team has made their simulated datasets available for download, and all of the code used in the project is available on GitHub.
Additional co-authors include Meng Zhang at the Inner Mongolia Agricultural University; Qiyun Zhu and Rob Knight at UC San Diego; Laxmi Parida at the IBM Thomas J. Watson Research Center; Anna Paola Carrieri at IBM Research Europe; Ho-Cheol Kim at the IBM Almaden Research Center; and Yang-Yu Liu at Brigham and Women’s Hospital and Harvard Medical School.
The Center for Microbiome Innovation is proud to include Yoshiki Vázquez-Baeza and Rob Knight on its leadership team.
This piece was written by CMI’s contributing editor Cassidy Symons
Figure 1 – Comparison of profiling results.
a, Illustration of the reference databases and the default output abundance type for DNA-to-DNA, DNA-to-Protein and DNA-to-Marker profilers on a mixture of two species A (1 cell) and B (2 cells). b, A simulated microbial community with only two genomes: Bacillus pseudofirmus (genome size 4.2MB) and Lactobacillus salivarius (genome size 2.1MB). We merged one copy of Bacillus pseudofirmus genome (genome A) with two copies of Lactobacillus salivarius genome (genome B) sequences into one metagenome file. Then we sheared the merged metagenomic sequences into 150bp to simulate a typical metagenomic dataset. c, Profiling results (default output) of different profilers for the simulated microbial community. The bar plots show the estimated relative abundance of the two microbial members A and B using different metagenomics profilers. PathSeq (default) represents the profiling result generated by the default setting of PathSeq (without genome-length correction). PathSeq (corrected) represents the profiling result of PathSeq with the parameter “–divide-by-genome-length” (i.e., genome-length correction) enabled.