The proliferation of sequencing technologies and an ever-increasing focus on ‘omic studies over the past several years have spawned massive quantities of data and many tools to process and analyze the results. As these tools have evolved, so has the need for even greater scale and the ability to visualize complex data sets in more efficient ways.
Born of these needs for advancement, a team of researchers with the Center for Microbiome Innovation (CMI) at the University of California San Diego (UC San Diego), led by Kalen Cantrell and Marcus W. Fedarko, unveiled EMPress, a new interactive visualization tool, in an article published in mSystems on March 16, 2021. The paper, entitled “EMPress Enables Tree-Guided, Interactive, and Exploratory Analyses of Multi-omic Data Sets,” showcases the tool’s functionality, versatility, and scalability, and explores some of the potential applications in visualization and data analysis.
“Although there are many excellent existing tools for visualizing trees, we wanted to create a tool that could be used within a web browser and could scale to really large trees,” Fedarko remarked when discussing the development process. “Another motivation was that there were many custom features that no existing tool, to our knowledge, supported – for example, the ability to interactively link samples in an ordination with a phylogenetic tree, showing the sequences contained within each sample, or going the opposite way, the samples containing a given sequence.”
Aside from interactively integrating ordinations with trees, EMPress also introduces the ability to animate a tree’s changes over time. While other tools have included a temporal dimension to trees, the ability to animate the tree alongside an ordination is a novel feature that the team believes will enhance the exploratory analysis of ‘omic data sets.
What began as an undergraduate project at UC San Diego’s Early Research Scholars Program has since grown into a collaboration that spans both academia and industry. The team at CMI has worked closely with IBM Research’s Artificial Intelligence for Healthy Living Center throughout the process, as well as researchers at the Simons Foundation, Cornell University, and Justus Liebig University.
Developing a tool that was not only web-based but capable of storing, processing, and visualizing such large phylogenetic trees presented numerous challenges. As the scale of microbiome and other ‘omic studies has continued to grow, so too has the technical complexity required to accommodate and analyze the large, complex data sets those studies produce.
“We resolved some of these issues by creating our own custom WebGL, shaders which are capable of visualizing millions of points, and implementing high-performance data structures such as the balanced parentheses tree that helped drastically reduce the memory footprint of EMPress,” according to Cantrell. As the code has been refined and optimized through several phases of development, it has enabled the analysis of larger and larger data sets, including a tree with over 750,000 nodes.
While the article in mSystems represents the tool’s formal unveiling, it has already seen use in several studies and applications. Researchers at the CMI utilized EMPress in their recently published analysis of the composition and microbial load of the human oral microbiome. Additionally, epidemiologists with UC San Diego’s Return to Learn program have been using the tool to help manage the university’s response to the COVID-19 pandemic. These applications have provided critical feedback and facilitated improvements to continue streamlining the analyses for which EMPress is designed.
As development continues, there are several planned features, including the ability to explore time-series data, potential anomaly detection, implementing support for annotating various types of data, and further improving performance to allow for larger data sets. The tool’s source code is available on GitHub, and the team is eager to see how their peers expand upon their work.
“Just as EMPress builds on the many existing tree visualization tools in the literature (iTOL, PHYLOViZ, Anvi’o, SigTree, FigTree, ggtree, TopiaryExplorer, MicroReact, Archaeopteryx, etc.), we’re sure that other tools will pop up soon enough that build on the concepts used in EMPress — we look forward to seeing how the landscape of these tools changes in the future,” Fedarko added.
Additional co-authors include Gibraan Rahman, Daniel McDonald, Yimeng Yang, Thant Zaw, Antonio Gonzalez, Mehrbod Estaki, Qiyun Zhu, Erfan Sayyari, George Armstrong, Anupriya Tripathi, Julia M. Gauglitz, Clarisse Marotz, Nathaniel L. Matteson, Cameron Martino, Se Jin Song, Austin D. Swafford, Pieter C. Dorrestein, Kristian G. Andersen, Yoshiki Vázquez-Baeza, and Rob Knight, all at UC San Diego; Stefan Janssen at Justus Liebig University; Niina Haiminen and Laxmi Parida at the IBM Thomas J. Watson Research Center; Kristen L. Beck and Ho-Cheol Kim at the IBM Almaden Research Center; James T. Morton at the Simons Foundation; Jon G. Sanders at Cornell University; and Anna Paola Carrieri at IBM Research Europe.
The Center for Microbiome Innovation is proud to include Pieter C. Dorrestein as a faculty member, as well as Se Jin Song, Austin D. Swafford, Yoshiki Vázquez-Baeza, and Rob Knight on its leadership team.
This piece was written by CMI’s contributing editor Cassidy Symons