Imagine trying to assemble a puzzle with an ever-increasing number of pieces that vary in clarity and detail. Just as you solve one small section, you step back and discover new pieces have spilled off the table and left you with shifting context, reshaping the task ahead.
Evolutionary microbiologists face precisely this predicament. Over the past five years, whole genome sequencing has unleashed a torrent of new data in the field. Public repositories have grown from approximately 650,000 microbial genomes in 2020 to more than 1.5 million today. While the availability of raw data is a boon, it has also presented a stumbling block.
To utilize those genomes, researchers need to be able to relate them to each other in evolutionary structures known as phylogenies or phylogenetic trees. Algorithms to decipher these relationships have struggled under the weight of growing genomic data. Existing methods construct entire phylogenetic trees in one go and can’t update results piece-by-piece. As the pool of available data is constantly growing, this approach presents a painful bottleneck to downstream microbiome analysis.
A team of researchers led by Metin Balaban and Siavash Mirarab at University of California San Diego (UC San Diego) sought to develop a more scalable method to infer phylogenetic trees. Their work was published by Nature Biotechnology on July 27, 2023, in an article entitled “Generation of accurate, expandable phylogenomic trees with uDance.”
“We wanted to build a framework that can allow us to build large phylogenies from genome sequences and, more importantly, grow it as more genomes are discovered by the scientific community,” Balaban, the study’s lead author, explained while discussing their motivations.
Rather than attempting to tackle an increasingly large problem in whole, they settled on a divide-and-conquer approach. By breaking the process into smaller, more manageable components, they achieved levels of scalability and efficiency that outstripped any existing methods of constructing phylogenies. Using their new method, they were able to construct a microbial tree of life with nearly 200,000 genomes—an order of magnitude improvement over the commonly used Web of Life.
Scalability alone is a step forward for the field, but uDance introduces two other critical benefits. First, the resulting phylogenies contain fewer errors and irregularities than existing methods produced.
“Through simulations and extensive analysis, we demonstrated that building the phylogenies step by step, starting with a small core and then adding additional genomes over time, not only improved scalability but also increased the overall accuracy of the results,” noted Balaban, who undertook this research while he was a PhD candidate in Bioinformatics and Systems Biology at UC San Diego.
Additionally, the algorithm allows existing phylogenetic trees to be updated incrementally, rather than constructed de novo. The ability to periodically update and expand reference phylogenies promises to accelerate downstream analyses in a variety of fields. “Such a growing tree will enable microbiologists to quickly analyze their samples, especially environmental samples, with respect to an up-to-date view of the microbial diversity known to science,” commented Mirarab, who is a professor in the Department of Electrical and Computer Engineering at UC San Diego.
The method has already been put to use by researchers in the Knight Lab at UC San Diego in the construction of the newly released Greengenes2 reference database.
“uDance is a critical tool in growing reference databases to microbial scales of diversity,” according to Daniel McDonald, PhD, co-author on this study and lead author on the Greengenes2 companion paper. “By updating an existing phylogeny, uDance dramatically reduces the computational burden to integrate new genomic information.”
While uDance’s benefits to the field are already evident, Mirarab is eager to refine and expand the algorithm’s capabilities. “We’d like to extend uDance to enable analyses of multi-copy genes,” he said while considering future applications. “Additionally, we have ideas for how some steps in the method can be replaced with machine learning. Perhaps these will improve accuracy or reduce the resources or time necessary to infer phylogenies.”
The team has made uDance and all data used in the study available via GitHub.
Additional co-authors include Yueyu Jiang and Rob Knight at UC San Diego, and Qiyun Zhu at Arizona State University.
The UC San Diego Center for Microbiome Innovation is proud to include Siavash Mirarab as a faculty member and Rob Knight on its leadership team.
A phylogeny constructed by uDance with over 200k genomes represented collapsed to the class and phylum levels depending on clade size. The taxonomy expressed is from the Genome Taxonomy Database, and the parentheticals depict the number of genomes represented.