Many criminal investigations, including “cold cases,” do not have a suspect but do have DNA evidence. In these cases, a genetic profile can be obtained from the forensic specimens at the crime scene and electronically compared to profiles listed in criminal DNA databases. If the genetic profile of a forensic specimen matches the profile of someone in the database, depending on other kinds of evidence, that individual may become the prime suspect in what was heretofore a suspect-less crime.
Searching DNA databases to identify potential suspects has become a critical part of criminal investigations ever since the FBI reported its first “cold hit” in July 1999, linking six sexual assault cases in Washington, D.C., with three sexual assault cases in Jacksonville, Florida. The match of the genetic profiles from the evidence samples with an individual in the national criminal database ultimately led to the identification and conviction of Leon Dundas.
How the statistical significance of a match obtained with a database search is presented to the jury should, in my view, be straightforward but, given the adversarial nature of our criminal justice system, remains contentious. One view is that if the profiles of the evidence and a suspect who had been identified by the database search match, then the estimated population frequency of that particular genetic profile (equivalent to the Random Match Probability in a non-database search case) is still the relevant statistic to be presented to the jury. The Random Match Probability (RMP) is an estimate of the probability that a randomly chosen individual in a given population would also match the evidence profile. The RMP is estimated as the population frequency of the specific genetic profile, which is calculated by multiplying the probabilities of a match at each individual genetic marker (the “Product Rule”).
An alternative view, often invoked by the defense, is that the size of the database should be multiplied by the RMP. For example, if the RMP is 1/100 million and the database that was searched is 1 million, this perspective argues that the number 1/100 is the one that should be presented to the jury. This calculation, however, represents the probability of getting a “hit” (match) with the database and not the probability of a coincidental match between the evidence and suspect (1/100 million), the more relevant metric for interpreting the probative significance of a DNA match. Although these arguments may seem arcane, the estimates that result from these different statistical metrics could be the difference between conviction and acquittal.
There are many different kinds of DNA databases. Ethnically defined population databases are used to calculate genotype frequencies and, thus, to estimate RMPs but are not useful for searching. The first DNA searches were of databases of convicted felons. In some jurisdictions, databases of arrestees have also been established and searched. These searches have recently been expanded to include “partial matches,” potentially implicating relatives of the individuals in the database. This strategy, known as “familial searching,” has been very effective but contentious, with discussions typically focused on the “trade-offs” between civil liberties and law enforcement. In some jurisdictions, the “trade-off” has been between two different controversial criminal database programs. In Maryland, for example, an arrestee database (albeit one specifying arraignment) was allowed but familial searching was outlawed. Familial searching has been critiqued as turning relatives of people in the database into “suspects.” A more accurate description is that these partial matches revealed by familial searching identify “persons of interest” and that they provide potential leads for investigation.
Recently, searching for partial matches in the investigation of suspect-less crimes has expanded from criminal databases to genealogy databases, as applied in the Golden State Killer case in 2018. These databases consist of genetic profiles from people seeking information about their ancestry or trying to find relatives. Genetic genealogy involves constructing a large family tree going back several generations based on the individuals identified in the database search and on genealogical records. Identifying several different individuals in the database whose profile shares a region of DNA with the evidence profile allows a family tree to be constructed. The shorter the shared region between two individuals or between the evidence and someone in the database, the more distant the relationship. This is because genetic recombination, the shuffling of DNA regions that occurs in each generation, reduces the length of shared DNA segments over time. So, in the construction of a family tree, the length of the shared region indicates how far back in time you have to go to locate the common ancestor. Tracing the descendants in this family tree who were in the area when the crime was committed identifies a set of potential suspects.
The DNA technologies used in investigative genetic genealogy (IGG) are different from those typically used in analyzing the evidence samples or the criminal database samples, which are based on around 25 short tandem repeat markers (STRs). The genotyping technology used to generate profiles in genealogy databases is based on analyzing thousands of single nucleotide polymorphisms (SNPs). With the recent implementation of Next Generation Sequencing technology to sequence the whole genome, even more informative searching for shared DNA regions can be accomplished. (Next Generation Sequencing of the whole genome is so powerful that it can now distinguish identical (monozygotic) twins!)
Investigative genetic genealogy (IGG) has completely upended the trade-offs and guidelines proposed for familial searching as well as many of the arguments. Many of the rationales justifying familial searching of criminal databases, such as the recidivism rate, and the presumed relinquishing by convicts of certain rights do not apply to genealogical databases. Also, the concerns about racial disparities in criminal databases don’t apply to these non-criminal databases either. In general, it’s very hard to draw lines in the sand when the sands are shifting so rapidly and the technology is evolving so quickly. And it is particularly difficult when dramatic successes in identifying the perpetrators of truly heinous unsolved crimes are lauded in the media, making celebrities of the forensic scientists who carried out the complex genealogical analyses that finally led to the arrest of the Golden State Killer and, shortly thereafter, to many others.
It’s still possible and desirable to set some guidelines for IGG, a complex and expensive procedure. It should be restricted to serious crimes. The profiles in the database should be restricted to those individuals who have consented to have their personal genomic data searched for law enforcement purposes. With the appropriate guidelines, the promise of DNA database searching to solve suspect-less crimes can truly transform our criminal justice system.
That was a fascinating read! The legal and ethical complexities of DNA database searches, especially in light of investigative genetic genealogy, are quite mind-bending. While the ability to solve cold cases is an amazing advancement, protecting privacy rights from getting overshadowed by justice is the overarching concern. The conversation regarding how statistical likelihoods are reported in court is of special interest—it’s important that juries are made clear about the meaning of DNA matches. Appreciate shedding light on such a significant and developing topic!