Sunday 31 July 2022

FamilyTreeDNA "Discover" and TMRCA estimates for I-Y33765

In previous articles I have discussed Time to Most Recent Common Ancestor (TMRCA) estimates for the phylogenetically important mutations in our I-Y33765 clade and how we can calculate these time intervals using the mutation rate of SNP and STR Y-DNA genetic markers. Being able to calculate approximate TMRCA is obviously a great benefit when we are researching the supposed historical context of events involved in our genealogy. 

One characteristic of the types of calculations we have used for obtaining TMRCA intervals is that these have relied upon either SNP-only or STR-only methods. It has been reported (Balanovsky, 2017) that while experimentally obtained SNP marker mutation rates largely overlap and produce usable TMRCA estimates for both genealogy and evolutionary studies, STR markers are more useful when constructing fine-scale phylogenies that are typically associated with genealogical pedigrees. However both types of marker have their own inherent problems. when they are used as the basis for TMRCA estimations. 

First, when using SNP markers to estimate periods within the genealogical timescale it is important to take care to select only the very high confidence mutations. Unfortunately these may not always be available or easy to identify with confidence.  Because genealogical timescales are relatively short, the omission or addition of just a single SNP can significantly alter estimates. Next, when using STR markers the major difficulty is presented by convergent and/or multiple mutations which can introduce error that again can be hard to identify with confidence, especially when a small number of individuals are being compared.  

For these reasons, the confidence intervals provided for our estimates are poor and there is a consequent need for an alternative method to estimate TMRCA that would give improved accuracy.  In 2021, a paper was published in Genes by Iain McDonald (McDonald, 2021) in which he presented a novel mathematical approach that combines probability calculations based on using SNP and STR markers with other probabilities derived from historical data and from ancient DNA to achieve more precise and accurate TMRCA intervals for genealogy.  

Earlier this month (July 2022) FamilyTreeDNA announced the release of an online feature they call "Discover" that provides "information about the haplogroup from your Y-DNA test".  The application is available to FTDNA customers and to others who simply need to register with that company. The tool allows users to input a Y-DNA haplogroup designation from which the application generates report pages which give a summary of relevant information, including geographic frequency, notable related individuals, migration routes and ancient DNA examples.  In addition, a section called "scientific details" gives options that list the base variants associated with that haplogroup, show its position within the Y-DNA haplotree and, most importantly for our interests, provide an estimation of the TMRCA at various confidence levels.  In presenting this latter feature (see Figure 1) the page rubric explains that it "is calculated based on SNP and STR test results from many present-day DNA testers" and "the state-of-the-art FamilyTreeDNA algorithm for inferring age estimates for the Y-DNA Haplotree. [was] Developed together with Iain McDonald."   

It seems to me this description and the helpful credit provide an indirect but clear reference to the combined probability model Iain McDonald describes in detail in his paper mentioned above.  As a result I think it worth briefly mentioning the significant advantages shown in McDonald's work now that we can all make use of the user-friendly version of his algorithm as it is provided by the FTDNA application.

In his paper McDonald describes the mathematical basis for a method which merges "the Y-SNP and Y-STR molecular clocks, and takes into account other available evidence (eg:, ancient DNA, proven paper genealogies, relatedness through autosomal DNA, etc)."  He demonstrates his revised algorithm using four examples.  In three scenarios he generates data which illustrate DNA ancestry either in colonial America, or in historical Scotland and Ireland, or medieval or prehistoric Europe and for the fourth model he uses real data from royal Stuart lineages.  With each of these example data sets he illustrates how his combined method gives improvements in the precision and accuracy of the TMRCAs compared to either STR-only or SNP-only methods.  

McDonald writes that "the most significant improvements in the precision of the TMRCAs come from the ability to combine both STR and SNP mutations into a single calculation" and he notes that in the future, improving the definition of STR and SNP mutation rates, offers the greatest likelihood for getting further benefits from his combined method over either STR-only or SNP-only TMRCA calculations.

Figure1: FamilyTreeDNA Discover presentation of TMRCA probability for I-Y33765
 
So it seems these advantages and possibilities are very encouraging, especially now that we all have easy access to an online tool that produces TMRCA periods based on this McDonald combined probability model.  
 
Now let us turn to how these developments may help our I-Y33765 research. When we compare the previously calculated SNP-only TMRCA estimates for the haplogroups within our clade with those produced using the FamilyTreeDNA "Discover" app. (Table 1) we can see it has provided some useful improvement.

Table 1: Comparison of TMRCA estimates for haplogroups within the I-Y33765 clade

At present within our clade we only have two haplogroups, I-Y33761 and I-BY198548, for which the dates are known from documentary sources. When we compare the "Discover" TMRCA estimates for both of these haplogroups with those obtained using our normal SNP-only methodologies (Table 1) there are improvements in the precision and accuracy achieved for both.  It seems to me that this is the most obvious practical method by which to judge the new FTDNA algorithm and based on this result I consider the new methodolgy is definitely helpful.  

In addition, the 95% confidence interval for the "Discover" estimates is significantly more constrained compared with the YFull SNP-only method.  Lastly, the mean dates given by the "Discover" application are broadly similar to those we have obtained using our clade-specific SNP mutation rate that we calculated directly from the nineteen generation Nils Swensson (1631-1713) pedigree. Because of these several positive indicators I have updated our draft I-Y33765 chart (Figure 2) using the TMRCA dates highlighted in Table 1.  As you can see the majority of these are taken from the FTDNA "Discover" application.


 Figure 2: I-Y33765 draft chart, July 2022 

(Click on images and table to enlarge)

References

Balanovsky, O (2017) Toward a consensus on SNP and STR mutation rates on the human Y-chromosome, Human Genetics, 136, 575-590

McDonald, I. (2021) Improved models of Coalescence Ages of Y-DNA Haplogroups, Genes, 12, 862 

Warlords, foederati, princes or pirates: Exploring some characteristics of the men involved in the star cluster expansion downstream of I-Y4252

There would seem to be something remarkable about the man who was the founder of the I-Y4252 haplogroup.  We can see this clearly from the e...