Official study (English):
Introducing the Y-chromosomal Ancestral-like Reference Sequence—Improving the Capture of Human Evolutionary Information
Territory/Diaspora:
World
Year: 2025
Abstract: Reference sequences are essential for reproducible genetic analyses but are often chosen without regard to evolutionary relevance within the analyzed species. The human Y chromosome is widely used in evolutionary studies, yet current references represent evolutionarily young sequences, which can cause misleading variant calling. To address this issue, we constructed a Y-chromosomal ancestral-like reference sequence to improve the detection of evolutionarily informative variants on the Y chromosome. The Y-chromosomal ancestral-like reference sequence was constructed by applying a weighted maximum parsimony approach to human and primate Y chromosome sequences. To benchmark the performance of the Y-chromosomal ancestral-like reference sequence, 40 Y chromosome short-read sequences from diverse haplogroups were aligned to Y-chromosomal ancestral-like reference sequence and existing references (GRCh37, GRCh38, and T2T-CHM13). Overall, the Y-chromosomal ancestral-like reference sequence yielded the highest and most consistent number of SNPs per sample (mean = 1,400; SD = 77), while other references yielded on average fewer variants (mean = 866 to 968) and showed greater variability across samples (SD = 457 to 531) depending on their phylogenetic distance from the reference. Additionally, alignments to the Y-chromosomal ancestral-like reference sequence resulted in calling solely SNPs with evolutionarily derived alleles, while alignments to other references resulted in calling on average 46% SNPs with ancestral alleles. This study demonstrates how the existing reference sequences fail to capture the full range of evolutionary information on the Y chromosome. The Y-chromosomal ancestral-like reference sequence improves capturing evolutionary information on the Y chromosome, making it a valuable resource for various evolutionary applications, such as TMRCA estimations and phylogenetic analyses. Finally, alongside the Y-chromosomal ancestral-like reference sequence, we provide a publicly available tool, polaryzer, to annotate variants as ancestral or derived in pre-aligned Y chromosome data.