The explosion of microbiome data available in recent years has enabled rapid advancements in the scope of research, yet it has also introduced myriad challenges. This is especially true for metrics that relate to phylogenetic diversity and relatedness, as the computational cost of calculation is dependent on the size of the phylogenetic tree and underlying sample database.
One such popular metric is Faith’s phylogenetic diversity, which has been noted as particularly sensitive in distinguishing disease factors in the human digestive system. As microbiome research continues to increasingly focus on applied problems in the realm of human health, the demand for these computationally expensive calculations has expanded accordingly.
Driven by the need to more efficiently compute diversity present in massive datasets, a group of researchers led by George W. Armstrong from the Center for Microbiome Innovation (CMI) at the University of California San Diego (UC San Diego), Niina Haiminen at the IBM Thomas J. Watson Research Center, and Anna Paola Carrieri at IBM Research Europe collaborated on a new algorithm and implementation to calculate Faith’s phylogenetic diversity. Dubbed Stacked Faith’s Phylogenetic Diversity (SFPhD), the work was published in the November issue of Genome Research in an article entitled “Efficient computation of Faith’s phylogenetic diversity with applications in characterizing microbiomes.”
“Microbiome data of unprecedented size is being produced with an increasing rate, and scalable computational methods for its analysis are direly needed,” Haiminen said while discussing the project’s origins. “Designing a method to efficiently compute a commonly used key diversity metric, Faith’s Phylogenetic Diversity, is helping to address this critical need.”
Development of Stacked Faith’s Phylogenetic Diversity was facilitated by the Artificial Intelligence for Healthy Living Center (AIHL), a collaborative endeavor between CMI and the IBM Research AI Horizons Network. The new algorithm is another exciting achievement from the AIHL team, joining recent studies on metagenomic profilers and feature selection methods and the release of EMPress in the list of advancements to the world of microbiome research.
While discussing the algorithm’s impact, Carrieri commented, “Once the new algorithm and implementation was ideated, the main challenge we faced was to clearly explain, for example with drawing, figures and text, the main steps and data structures used by our new algorithm, so that its complexity, computational efficiency, and impact was clearly delivered, and therefore appreciated, to a microbiome research audience.”
To overcome the challenges of sparsity and heterogeneity in modern ‘omics datasets, SFPhD uses a sparse matrix implementation for storing information, an efficient tree structure, and partial aggregation of metric constituents. With all of these improvements in place, the algorithm outperforms prior implementations by a wide margin.
“A calculation that used to require a supercomputer can now be run on an ordinary laptop,” Armstrong observed. “For example, on all of the data in Qiita, we projected 19 CPU hours and three terabytes of memory required. With the new algorithm, we reduced that to one CPU hour and three gigabytes of memory.”
Due to the optimizations in both processing time and memory usage, the algorithm has a significantly lower environmental impact than existing implementations. The team estimates an over-300-fold reduction in carbon footprint based on Green Algorithms.
While the improved computational efficiency is an achievement in and of itself, it also enables increased usage of Faith’s phylogenetic diversity in metagenomic sequencing, which was previously under-explored due to technical limitations. The research group touched on these possibilities by reanalyzing stool samples from the FINRISK study and discovered that Faith’s PD provided improved statistical power over observed features.
Furthermore, the underlying implementation is not tied to any particular molecular technology. The team envisions potential applications in conservation prioritization, nutrition, and metabolomics research.
SFPhD was released under the BSD license in the unifrac package on GitHub. Additionally, the benchmarking code is available in a separate package on GitHub.
Additional co-authors include Kalen Cantrell, Shi Huang, Daniel McDonald, Antonio Gonzalez, Imran McGrath, Daniel Hakim, Mohit Jain, Austin D. Swafford, Yoshiki Vázquez-Baeza, and Rob Knight, all at UC San Diego; Qiyun Zhu at Arizona State University; Kristen Beck and Ho-Cheol Kim at the IBM Almaden Research Center; Aki S Havulinna, Teemu Niiranen, and Veikko Salomaa at the Finnish Institute for Health and Welfare; Guillaume Méric at the Baker Heart and Diabetes Institute; Leo Lahti at the University of Turku; Michael Inouye at the University of Cambridge; and Laxmi Parida at the IBM Thomas J. Watson Research Center.
The Center for Microbiome Innovation is proud to include Mohit Jain as a faculty member, as well as Austin D. Swafford and Rob Knight on its leadership team.
Figure 4(B) Web of Life (WoL) phylogenetic tree with branches colored by the log of likelihood ratio of old adults compared to young adults in descendants of the branch, for the FINRISK data set. The inner circle is colored by the log of likelihood ratio of older adults compared to younger adults in the tips of the tree. The outer circle is colored by the phylum of the taxon represented by each tree tip. Red ellipses mark two clades enriched for samples from older individuals.