New analysis implicates previously incomplete human reference genome as source of bias and identifies ethical risks within metagenomic data.
As the pursuit of health and wellness escalates, scientists are spending more time and resources searching for microbially derived insights and their effects on quality of life. Researchers have discovered how human reference genomes’ limitations and incomplete removal of human DNA from the metagenomic datasets used to study the microbiome result in bias and data-privacy risks for those whose samples’ sequencing data are included.
Recent findings have identified an exciting potential solution to fully address systematic biases that were previously hidden within host-microbiome metagenomic analysis, by comprehensively removing human DNA and using recently released complete human reference genomes. The results come from a collaborative study led by researchers at the Center for Microbiome Innovation (CMI) at the University of California San Diego (UC San Diego).
Their work was published in Nature Communications on January 18, 2025, in an article entitled “Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data.” This publication introduces new analysis techniques designed to specifically improve the ability to algorithmically remove human DNA from metagenomic data. This innovation revealed both sex biases in legacy analyses and the concerning risks of exposing personally identifiable information previously hidden within metagenomic data.
Combining UC San Diego’s sophisticated data filtration capabilities and expertise in microbiome metagenomic assessment, this study discovered detrimental effects from improper host filtration when exploring sequencing data from a cohort of metastatic human tumor tissues. These data were originally processed and published before the release of an updated human reference genome, which added a complete human Y chromosome, and have since been independently analyzed for microbiomes. Without the inclusion of the Y chromosome in the human reference genome, microbiome analysis may incorrectly assign portions of the Y chromosome to a microbial signature from the sample.
“Previous studies have shown the types of microbes in or on our bodies, and associates those microbes with disease states such as atopic dermatitis or Alzheimer’s,” noted lead author Caitlin Guccione, a graduate student in Bioinformatics and Systems Biology at UC San Diego. “The method is a major advance from both a technical and ethical perspective. Host filtration workflows exclusively use a single human reference, which fails to capture the diversity of human genomes and cannot remove population-specific variation. Portions of the human genome that are incomplete in these references, such as the Y chromosome, can permit flow-through of human reads from those regions to microbial mapping steps, leading to the mismapping of taxa during classification and artifactual data distributions (e.g., false sex differences in the low biomass microbial profiles).”
Two notable findings emerged from the analysis: First, the team found that the inclusion of the Y chromosome in the human reference genome reclassified several microbial signatures that had been misassigned in previous studies. When observed DNA fragments are correctly filtered and mapped to microbial taxa, the quantitative diversity metrics more accurately reflect the per-sample microbial diversity. Second, improper host filtration of metagenomic samples can leak sensitive genomic information. In a recent study, Tomofuji et al. re-identified patients from human reads that leaked through fecal metagenomic data, matching them to blood-derived genotype data from the same individuals. Inclusion of complete reference genomes that incorporate the Y chromosome and application of the advanced host filtration workflows demonstrate a state-of-the-art capability in preserving subject privacy.
“The study’s findings improve our ability to understand microbiome diversity while limiting privacy issues, and we view them as opening up a whole new phase of research on samples with low microbial biomass,” commented corresponding author Rob Knight, the CMI Faculty Director and Professor of Pediatrics, Bioengineering, Computer Science & Engineering and Data Science at UC San Diego. “The method is a major advance from both a technical and ethical perspective.”
Additional co-authors include Lucas Patel, Daniel McDonald, Antonio Gonzalez, Gregory D. Sepich-Poore, Yang Chen, Amanda Hazel Dilmore, Neil Damle, George Hightower, Teruaki Nakatsuji, Richard L. Gallo, and Kit Curtius, all at UC San Diego; Yoshihiko Tomofuji, Kyuto Sonehara, and Yukinori Okada at the University of Tokyo; Mohsen Zakeri and Ben Langmead at Johns Hopkins University; and Sergio E. Baranzini at University of California San Francisco.
CMI is proud to include Richard L. Gallo as a faculty member and Rob Knight on its leadership team.
About the UC San Diego Center for Microbiome Innovation: UC San Diego is a world-leader in microbiome research, biomedical engineering, quantitative measurements and modeling, cellular and chemical imaging, drug discovery, “‘omics” sciences, and much more. CMI leverages the university’s strengths to draw interdisciplinary teams of researchers together and push the boundaries of the human understanding of microbiomes — the distinct constellations of bacteria, viruses, and other microorganisms that live within and around humans, other species, and the environment.
The CMI partners with top companies in a variety of industries from around the world, creating a bridge with UC San Diego faculty, researchers, graduate students, and postdoctoral researchers to collaborate on projects that are transforming the way the microbiome is studied. Through these collaborations, we develop the talent and technology the industry needs for the future of microbiome research.
This work was supported by: AGA Research Foundation (AGA Research Scholar Award AGA2022-13-05) and NIH grant R01 CA270235 to K.C. The study was supported in part by the NIDDK-funded San Diego Digestive Diseases Research Center (P30 DK120515) to K.C. Additionally this work was supported by NIH grants (R01 CA241728, P30 CA023100, NIH/NIGMS T32GM007198, NIH Pioneer DP1AT010885), the National Cancer Institute (NCI U24CA248454), and CDC award 75D301-22-C- 14717 to R.K. The study was supported in part by R21HG013433 to B.L. This study was supported in part by JSPS KAKENHI (22H00476), and AMED (JP24km0405217, JP24ek0109594, JP24ek0410113, JP24kk0305022, JP243fa627002, JP243fa627010, JP243fa627011, JP24zf0127008, JP24tm0524002, JP24wm0625504, JP24gm1810011), JST Moonshot R&D (JPMJMS2021, JPMJMS2024), to Y.O., with additional support from Takeda Science Foundation, Ono Pharmaceutical Foundation for Oncology, Immunology, and Neurology, Bioinformatics Initiative of Osaka University Graduate School of Medicine, Institute for Open and Transdisciplinary Research Initiatives, Center for Infectious Disease Education and Research (CiDER), and Center for Advanced Modality and DDS (CAMaD), Osaka University. This project was enabled in part by the Alzheimer’s Gut Microbiome Project (AGMP), supported by the National Institute on Aging grants: 1U19AG063744 and 3U19AG063744-04S1, awarded to Dr. Kaddurah-Daouk at Duke University in partnership with multiple academic institutions. As such, the investigators within the AGMP not listed in this publication’s authors’ list, provided analysis-ready data, but did not participate in designing the study, conducting the analyses or writing of this manuscript. A listing of AGMP Investigators can be found at: https://alzheimergut.org/meet-the-team/. A complete listing of the AD Metabolomics Consortium (ADMC) investigators can be found at: https://sites.duke.edu/adnimetab/team/. We thank Cameron Martino for his support and advice throughout this project.
Disclosures: D.M. is a consultant for BiomeSense, Inc., has equity and receives income. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. G.H. is the recipient of the Robert A. Winn Diversity in Clinical Trials: Career Development Award, which is partly funded by Bristol-Meyer Squibb Foundation. B.L. is the owner of InOrder Labs LLC. K.C. has research grant support from Phathom Pharmaceuticals. R.K. is a scientific advisory board member, and consultant for BiomeSense, Inc., has equity and receives income. He is a scientific advisory board member and has equity in GenCirq. He is a consultant for DayTwo, and receives income. He has equity in and acts as a consultant for Cybele. He is a co-founder of Biota, Inc., and has equity. He is a cofounder of Micronoma, and has equity and is a scientific advisory board member. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. The remaining authors declare no competing interests.