The Midpoint Effect in G25 and PCA

When you look at a PCA plot and see Kazan Tatars sitting far below Russians on the East Asian axis, or Somalis clustering midway between Yoruba and Lebanese, something systematic is happening. This isn't noise, and it isn't a flaw in the data. It is a fundamental mathematical property of admixture itself: the midpoint effect.

Understanding this effect is essential for anyone who wants to interpret G25 results honestly, whether as a reader comparing population charts, a hobbyist running Vahaduo models, or a researcher evaluating calculator outputs. Three population groups illustrate the principle with exceptional clarity: the Tatars of Eurasia, the Anatolian Turks, and the populations of the Horn of Africa.

1. The Principle: How Admixture Creates Midpoints in PCA Space
2. Case Study 1, The Tatars: A West-East Eurasian Gradient
3. Case Study 2, Anatolian Turks: West Eurasian with a Turkic Thread
4. Case Study 3, Horn of Africa: Between the Sahel and Arabia
5. Consequences for Vahaduo Admixture Modeling
6. G25 Coordinates: Key Populations

1. The Principle: How Admixture Creates Midpoints in PCA Space

Principal Component Analysis (PCA) works by reducing thousands of genetic variant frequencies into a small number of summary dimensions. Each dimension is a linear combination of allele frequencies across the genome. Because of this linearity, the position of any population in PCA space is, to a close approximation, the weighted average of the positions of its ancestral source populations.

Put simply: if a population is 60% descended from Group A and 40% descended from Group B, it will plot approximately at the point 0.60 × A + 0.40 × B. The admixed population sits between its sources, at a position determined entirely by the mixing proportions, not by geography, language, or cultural affiliation.

Figure 1. The midpoint principle: admixed populations always plot between their source groups on any PCA dimension. The exact position depends only on the mixing proportions.

This is why the midpoint effect has nothing to do with geography. A Kazan Tatar living in the city of Kazan, deep in European Russia, will plot far below any other Russian population on the East Asian principal component axis, not because of where they live, but because approximately 25, 35% of their ancestry traces back to Central Asian / Mongolian-related populations. Their genetic signal is a weighted average of those two deeply distinct ancestral streams.

Common Misconception

"A population that clusters near Europeans on PCA must primarily be European, or must have European-like ancestry."

What the Data Shows

A population's PCA position reflects the centroid of all its ancestry components. Two populations can cluster in similar PCA space via completely different combinations of ancestries. Only the admixture endpoints reveal the true picture.

2. Case Study 1, The Tatars: A West-East Eurasian Gradient

"Tatar" is not a single genetic population, it is a family of historically Turkic-speaking groups who settled across Eurasia at different times and in different environments, mixing locally in each region. As a result, the various Tatar populations form one of the clearest illustrations of the midpoint effect available in modern population data.

At one end of the spectrum, the Mishar Tatars of the Volga-Ural region show high proportions of Eastern European ancestry and plot close to Russians and other Eastern Europeans on PCA, yet still displaced perceptibly downward toward the East Asian axis. The Kazan Tatars, historically the most prominent Tatar group, sit slightly further along that axis. The Lipka Tatars of Belarus and Lithuania, descendants of Tatar settlers who assimilated deeply into Eastern European populations, cluster even closer to Eastern Europeans. At the opposite extreme, the Siberian Tatars of the Tomsk and Ishtyak regions show dramatically higher Central Asian and East Asian admixture and plot substantially toward the Kazakh-Kyrgyz cluster.

Figure 2. G25 PCA plot (PC1 × PC2) illustrating the Tatar midpoint gradient. Tatar populations form a continuous line between the Eastern European cluster and the Kazakh/Central Asian cluster. The exact position of each Tatar group reflects its local admixture history, not its geographic location within Russia.

The gradient is striking and systematic. Mishar and Kazan Tatars cluster closest to Eastern Europeans on the PC2 axis, yet already show clear displacement toward the Kazakh direction. Moving down the chain to Siberian Tatars, the Central Asian / East Asian signal grows dominant. Yet all these populations live within Russian territory, geography tells you nothing. The midpoint position tells you everything.

The Kazan Tatar Midpoint: Verified by the Numbers

Kazan Tatars plot at approximately PC1 = 0.109, PC2 = 0.011 in the G25 scaled coordinate system. A simple weighted interpolation of ~70% Russian (PC1 ≈ 0.131, PC2 ≈ 0.118) and ~30% Kazakh (PC1 ≈ 0.068, PC2 ≈ −0.219) yields PC1 = 0.112, PC2 = 0.017, a near-perfect match. Their position is not approximate or accidental: it is a precise mathematical consequence of their ancestral proportions.

~30%

Central Asian / East Asian ancestry in Kazan Tatars

~65%

Eastern European / West Eurasian ancestry in Kazan Tatars

5 groups

Genetically distinct Tatar populations on a single continuous gradient

3. Case Study 2, Anatolian Turks: West Eurasian with a Turkic Thread

Modern Turks are the product of one of history's most studied cases of language replacement without demographic replacement. When Turkic-speaking nomads from Central Asia entered Anatolia in the 11th century CE, they brought their language, culture, and religion, but not, it turns out, the majority of the region's gene pool. Ancient DNA from Byzantine-era Anatolia shows that the pre-Turkic population was heavily West Eurasian (Anatolian Neolithic / CHG / Iranian-related ancestry), and most modern Turks continue to reflect this ancestry overwhelmingly.

Nevertheless, the Central Asian / Turkic component, though modest, is unmistakably present, and it is precisely this admixture that creates the midpoint effect. Anatolian Turks cluster below the Greek and Lebanese populations on the East Asian principal component axis, displaced toward the Kazakh direction in proportion to the degree of Central Asian admixture each regional subgroup carries.

Figure 3. G25 PCA showing Anatolian Turkish populations displaced below the Greek / Lebanese / Saudi cluster toward the Kazakh direction, reflecting their Central Asian Turkic admixture. Trabzon and Erzurum Turks (Northeast Anatolia) cluster closest to Europeans; Yörük and Southeastern Turks show more South/East Eurasian pull.

Several patterns emerge from this Turkish PCA landscape. First, the internal variation within Turkey is substantial. Turks from the Northeast (Trabzon, Erzurum, Rize), descended from Pontian Greek and Armenian populations who converted and assimilated during the Ottoman period, cluster very close to Greeks and other Europeans, with minimal Central Asian signal. Turks from Western Anatolia (Istanbul, Izmir, Manisa) occupy an intermediate position, reflecting the historically cosmopolitan demographic mixing of those regions. The Yörük nomadic populations show somewhat more Central Asian ancestry, consistent with their closer cultural and genealogical connection to the original Turkic migrations.

Misconception

"Turks are ~80, 90% West Eurasian with only ~10, 15% Turkic ancestry.", a figure often cited from models using modern Central Asian populations as proxies.

What Medieval Proxies Show

When modeled against medieval sources (Byzantine Anatolian + Medieval Turkic Nomad), the Turkic component rises to ~28, 46%. The discrepancy reveals the midpoint effect acting on the source population itself.

The Turkic Source Is Itself a Midpoint, A Double Midpoint Effect

This is where the Turkish case becomes a particularly rich illustration of the midpoint effect. The Medieval Turkic Nomads of Western Altai were not a pure East Asian population. Ancient DNA shows they were themselves admixed: roughly half West Eurasian (Iranian / Steppe MLBA) and half East Asian / Siberian in origin. They were, in genetic terms, already a midpoint population before they ever reached Anatolia.

This creates a double midpoint structure in modern Anatolian Turks:

Figure 3b. The double midpoint structure of Anatolian Turks. The Medieval Turkic migrants were already an admixed West + East Eurasian population before reaching Anatolia. Modern Anatolian Turks are therefore a "midpoint of a midpoint", which is why the apparent Turkic % varies dramatically depending on the proxy chosen.

Anatolian Turks: Medieval-Source Model (Akbari et al. 2026)

Using medieval aDNA proxies, Pre-Turkic Anatolian, Medieval Turkic Nomad, Medieval Caucasian, Medieval Iranian, produces a substantially different and more historically accurate picture than modern Central Asian proxies.

Pre-Turkic Anatolian

Medieval Turkic Nomad

Medieval Caucasian

Medieval Iranian

Trabzon Turks

0% Turkic

? Istanbul is not genetically homogeneous. The city draws individuals from across Turkey, and historically from the entire Ottoman world. G25 data shows two clearly distinct sub-profiles: an Anatolian Turk profile (more Turkic Nomad, more Iranian/Caucasian input) and a Balkan Turk (Rumelian) profile (more Greek/Balkan European, minimal Turkic Nomad). The bar below reflects the Anatolian Turk sub-profile only (n=8 in the Akbari model). A Rumelian Istanbul individual would look markedly different.

Istanbul (Anatolian profile)

21.4% Turkic

Istanbul (Balkan/Rumelian profile)

~4% Turkic*

Ankara Turks

27.8% Turkic

Mersin Yörük

38.4% Turkic

Manisa Yörük

42.2% Turkic

Source: Akbari et al. 2026 calculator, medieval aDNA proxies. Trabzon Turks show 0% Medieval Turkic, their displacement on PCA reflects the high Medieval Caucasian component (46%), consistent with post-Byzantine Pontian Greek and Laz assimilation. *Istanbul Balkan/Rumelian estimate is derived from G25 PCA interpolation (Balkan*Turk*(Istanbul) clusters near Greeks with PC2 ≈ 0.098 vs. Anatolian Turk Istanbul PC2 ≈ 0.087), not from the Akbari model directly; exact values would require individual-level sampling of Rumelian-origin Istanbul residents.

Figure 3c. Istanbul's internal heterogeneity on G25 PCA. The city contains at minimum two genetically distinguishable sub-populations: Rumelian/Balkan-origin individuals (clustering near Greeks, minimal Medieval Turkic) and Anatolian-origin individuals (displaced further toward the Kazakh axis). An "average Istanbul" data point blurs this meaningful distinction.

Two Ways to Read the Same Data

The chart below places the two representations side by side for each population. The left bar shows the 4-source medieval model (historical detail); the right bar reduces everything to a simple West Eurasian vs East Eurasian split, computed by treating each Medieval Turkic Nomad percent as ~50% East Eurasian + ~50% West Eurasian (consistent with ancient DNA evidence for that source population).

Figure 3d. Side-by-side comparison of the medieval 4-source model (left) and the simplified West vs East Eurasian decomposition (right). The East Eurasian signal in each population is derived by treating the Medieval Turkic Nomad component as ~50% East Eurasian, consistent with ancient DNA from Western Altai medieval Turkic individuals. Istanbul is shown as two distinct sub-profiles. Note the stark contrast between the two Istanbul profiles: the Rumelian profile carries barely any East Eurasian signal (~2%), while the Anatolian profile reaches ~11%.

Why the numbers differ so much from earlier estimates: When using modern populations like Kazakhs or Uzbeks as "Central Asian Turkic" proxies, those populations are themselves less East Asian than the original medieval Turkic migrants (due to millennia of subsequent West Eurasian gene flow). This means a model using modern proxies will overestimate the West Eurasian component and underestimate the Turkic one. The medieval proxies from Akbari et al. (2026), which directly sample the actual migration-era populations, give a substantially higher and historically more credible Turkic estimate.

4. Case Study 3, Horn of Africa: Between the Sahel and Arabia

The populations of the Horn of Africa, Somalis, Ethiopians (Amhara, Oromo), and Afar, represent one of the most visually compelling examples of the midpoint effect in global population genetics. On a PCA plot, they sit in an otherwise largely empty stretch of genetic space, precisely bridging the vast distance between Sub-Saharan Africans and Middle Eastern / Arabian populations.

This position is not a coincidence or an artifact. The populations of the Horn carry a well-documented admixture of two deep ancestry streams: an ancient sub-Saharan African component related to East African hunter-gatherers and early pastoralists, and a substantial West Eurasian / Levantine component that entered the region from Arabia via multiple waves over the past several thousand years (including the so-called Cushitic and pre-Cushitic migrations as well as more recent flow). The proportions of these two streams vary systematically across the populations of the region, producing a clear gradient.

Figure 4. Horn of Africa populations as a midpoint between Sub-Saharan Africans and Near Easterners/Arabians on G25 PCA (PC1 × PC2). Somalis and Oromo cluster at approximately the 50/50 midpoint; Amhara and Afar are displaced slightly toward the West Eurasian end. None of these populations cluster near their closest geographic neighbors, their position is determined by genetic admixture, not location.

The midpoint verification here is just as precise as in the Tatar case. Somalis (PC1 ≈ −0.297, PC2 ≈ 0.094) plot at approximately the weighted mean of Yoruba (PC1 ≈ −0.630, PC2 ≈ 0.063) and Saudi Arabia (PC1 ≈ 0.053, PC2 ≈ 0.140), consistent with roughly 50, 55% sub-Saharan African and 45, 50% Levantine/West Eurasian ancestry. Amhara and Afar populations show slightly higher West Eurasian proportions (~55, 60%), consistent with a longer or more intense history of population movement from the Arabian Peninsula.

Key insight: If you did not know where Somalis lived and saw only their PCA position, you might guess they were an admixed population from somewhere in the Arabian Sea basin, halfway between the African continent and the Levant. You would be right, genetically speaking. The midpoint effect is geography-blind: it only reflects what went into the genome.

Horn of Africa Populations: Approximate Ancestry Proportions

Sub-Saharan African (E. African related)

West Eurasian / Levantine / Arabian

Oromo

~44% Levantine

Somali (Somalia)

~48% Levantine

Amhara

~59% Levantine

Afar

~61% Levantine

Figures derived from PCA interpolation between Yoruba and Saudi reference clusters on G25 PC1/PC2. Actual admixture models with specific sources will yield different breakdowns depending on the source populations chosen.

5. Consequences for Vahaduo Admixture Modeling

The midpoint effect has direct, practical consequences for anyone running admixture models on Vahaduo or similar NNLS-based tools. The core issue can be stated simply: if you use an admixed population as a source proxy, you import its midpoint position into your model, and this systematically distorts the results.

The Proxy Problem

Suppose you are modeling a modern Turkish individual and you include Kazakh Tatars as your "Central Asian" source. Kazakh Tatars are themselves admixed (West European + East Asian), so using them as a proxy for "steppe/Central Asian ancestry" actually introduces a blended signal. The model will try to reach

Back to Blog

Table of Contents