Volume 19 Number 6 November - December 2006

Special Section: Brief on Genetic Determinism and Intelligence

The Myth of Fingerprints
by Simon A. Cole

Statisticians Not Wanted
by Keith Devlin

Permanently Detained
by Helen Wallace

Headlines: Biotechnology in the News


To find out more about subscribing to GeneWatch and having it delivered to your doorstep six times a year, just click here.

SEARCH >

RECEIVE CRG EMAIL >

 

ARCHIVES / ABOUT / SUBSCRIBE TO GENEWATCH

Statisticians Not Wanted
by Keith Devlin

What statistics should be presented in court so that a jury can fairly evaluate the significance of DNA profile evidence when the defendant was first identified by a “cold hit” search? That is to say, when a person is made a suspect because of a match between his or her DNA at a crime scene and DNA on file in an existing DNA database, what does that information alone tell us about their guilt or innocence? And who is best qualified to make that decision, statisticians or the courts?

These were precisely the questions that led me (a professional mathematician) and a number of other scientists, statisticians, mathematicians and scholars to co-sign an amicus letter filed on July 24, 2006 with the California Supreme Court in the case of People v. Michael Johnson, No. S144821.

We were worried by previous court rulings in this and other DNA cold hit cases in which courts seemed to be taking the position that it was their job to decide which statistics were pertinent to the case, without seeking advice from the relevant scientific community, in this case statisticians. Our letter argued for one thing and one thing only: that in cold hit cases, the court should proceed as it does in other kinds of cases where scientific evidence is involved, namely to seek the testimony of the appropriate scientific experts. To our amazement and dismay, the court denied the petition of which our letter was a part.

Behind our letter was the fact that it is not at all obvious what exactly is the appropriate statistic to use in a cold hit case. The figure favored by the FBI and many prosecutors is the so-called “random match probability” (RMP), computed by multiplying together the probabilities of random matches on each of the DNA loci tested.* For example, assuming that the probability that the profiles of two randomly chosen, unrelated individuals match on a single, specific DNA locus is one-in-ten (a fairly reasonable assumption), the RMP that two randomly chosen, unrelated individuals have profiles that match on, say, all the 13 loci stored in the FBI's CODIS database is (1⁄10)^13, or one in ten trillion.

Like many other scientists, I find this computation troubling, since it depends crucially upon the assumed independence (in the sense of probability theory) of the 13 loci, an assumption for which there appears to be no empirical evidence, and there is even some evidence to the contrary. But that was not the point of our amicus brief. Rather, our focus was on which statistic should be presented in court to enable the jury to weigh the DNA match evidence presented following a cold hit identification. The FBI, along with the prosecution in the Johnson case, claim that the RMP is still the right one to present, and that the fact that the defendant was first identified by a DNA database trawl makes no difference. I, along with most other mathematicians and statisticians who have considered the matter, have argued otherwise.

To illustrate why the RMP is an inappropriate and misleading statistic in a cold hit case, consider the following analogy. A typical state lottery will have the odds of winning a major jackpot around 1 in 35,000,000. To any single individual, therefore, buying a ticket is clearly a waste of time, since those odds are effectively nil. But suppose that each week, at least 35,000,000 people actually do buy a ticket. (This is a realistic example.) Then every one to three weeks, on average, someone will win. The news reporters will go out and interview that lucky person. What is special about that person? Absolutely nothing. The only thing you can say about that individual is that he or she is the one who had the winning numbers. You can draw absolutely no other conclusion. The 1 in 35,000,000 odds tells you nothing about any other feature of that person. The fact that there is a winner reflects the fact that 35,000,000 people bought a ticket — and nothing else. (To put it another way, the 1 in 35,000,000 figure tells you a lot about the lottery — in that sense it is a “relevant” statistic, to use the language the FBI regularly uses in cold hit cases — but, absent other evidence, it tells you nothing about the winner.)

Compare this to a reporter who hears about a person with a reputation of being unusually lucky, goes along with them as they buy their ticket, and sits alongside them as they watch the lottery result announced on TV. Lo and behold, the person wins. What would you conclude? Most likely, that there has been a swindle. With odds of 1 in 35,000,000, it's impossible to conclude anything else in this situation.

In the first case, the long odds alone tell you nothing about the winning person, other than that they won. In the second case, the long odds tell you a lot. To my mind, a cold hit measured by RMP is like the first case. All it tells you is that there is a DNA profile match. It does not, in of itself, tell you anything else, and certainly not that that person is guilty of the crime.

On the other hand, if an individual is identified as a crime suspect by means other than a DNA match, then a subsequent DNA match is like the second case. It tells you a lot. Indeed, assuming the initial identification had a rational, relevant basis (like a reputation for being lucky in the lottery case), the long RMP odds against a match could be taken as conclusive. But as with the lottery example, in order for the long odds to have any weight, the initial identification has to be before the DNA comparison is run (or at least demonstrably independent of it). Do the DNA comparison first, and those impressive sounding long odds may be totally meaningless. It simply reflects the size of the relevant population, just as in the lottery case.

But if the RMP is not the right figure to use, what is? In 1989, the FBI urged the National Research Council to carry out a study of the matter. The NRC issued its report in 1992. Titled DNA Technology in Forensic Science, the report is often referred to as “NRC I”. The report's main recommendation regarding the cold hit process is given on page 124:

The distinction between finding a match between an evidence sample and a suspect sample and finding a match between an evidence sample and one of many entries in a DNA profile databank is important. The chance of finding a match in the second case is considerably higher. ... The initial match should be used as probable cause to obtain a blood sample from the suspect, but only the statistical frequency associated with the additional loci should be presented at trial (to prevent the selection bias that is inherent in searching a databank).

In part because of the controversy the NRC I report generated among scientists regarding the methodology proposed, and in part because courts were observed to misinterpret or misapply some of the statements in the report, in 1993, the NRC carried out a follow-up study. That study, The Evaluation of Forensic DNA Evidence (often referred to as “NRC II”), was published in 1996. NRC II's main recommendation regarding cold hit probabilities is:

Recommendation 5.1. When the suspect is found by a search of DNA databases, the random-match probability should be multiplied by N, the number of persons in the database.

The statistic NRC II recommends using is generally referred to as the database match probability, or DMP. NRC II's reasoning is essentially the same logic as I presented for my analogy with the state lottery.

Since two reports by committees of acknowledged experts in DNA profiling technology and statistical analysis came out strongly against the admissibility of the RMP, one might have imagined that would be the end of the matter, and that judges in a cold hit trial would rule in favor of admitting either the RMP for loci not used in the initial identification (a la NRC I) or else the DMP (a la NRC II), but not the RMP calculated on the full match.

However, not all statisticians agreed with the conclusions of the second NRC report. Most notably, Dr. Peter Donnelly, Professor of Statistical Science at the University of Oxford, took a view diametrically opposed to that of NRC II. According to Donnelly,

I disagree fundamentally with the position of NRC II. Where they argue that the DNA evidence becomes less incriminating as the size of the database increases, I (and others) have argued that in fact the DNA evidence becomes stronger... The effect of the DNA evidence after a database search is two-fold: (i) the individual on trial has a profile which matches that of the crime sample, and (ii) every other person in the database has been eliminated as a possible perpetrator because their DNA profile differs from that of the crime sample. It is the second effect, of ruling out others, which makes the DNA evidence stronger after a database search...

Donnelly advocated using a different statistic, which, while generally close in value to the RMP, results from a very different calculation. Donnelly's proposed calculation was considered by NRC II and expressly rejected. With the experts disagreeing in such a fundamental way, it is scarcely a wonder that judges have become confused as to what number or numbers should be presented as evidence in court.

Two things the statisticians do agree upon, however, is that a DNA profile match following a cold hit search is most definitely not the same as one carried out after a suspect has been identified by other means, and (hence) the calculation that should be performed after a cold hit search should not be the same as the one carried out in other circumstances.

In the Johnson case, however, the Court of Appeal of the State of California Fifth Appellate District, in an opinion issued on May 25, 2006, declared:

In our view, the means by which a particular person comes to be suspected of a crime — the reason law enforcement's investigation focuses on him — is irrelevant to the issue to be decided at trial, i.e., that person's guilt or innocence.

The court continued a short while later:

...the fact that here, the genetic profile from the evidence sample (the perpetrator's profile) matched the profile of someone in a database of criminal offenders, does not affect the strength of the evidence against appellant... The fact appellant was first identified as a possible suspect based on a database search simply does not matter.


By totally misunderstanding the one issue on which all the experts agree, the court could hardly have gotten things more wrong. The court is so wrong on that issue that their ruling must surely be overturned in due course. But what of the disagreement between the experts regarding which reliability statistic should be used in a cold hit case?

In my view, it is unwise in the extreme (and, as far as I know, inadmissible in court) to allow evidence, and base convictions on, disputed science. Thus, until the statistics community reaches consensus regarding the appropriate scientific procedure to use, the safest approach would seem to be to adopt a procedure that is free from controversy.

One possibility would be to follow NRC I, taking advantage of much improved DNA testing technology, and extend the match process to more than 13 loci. Such a move would more than compensate for the increase in the accidental match probability, however it is calculated, which results from a cold hit search. Another option would be to follow the logic of NRC II, and use the DMP (and only the DMP) in court. Since the DMP is generally more favorable to the defendant than is the RMP, this procedure would be to risk erring on the side of avoiding false convictions. However, with the current magnitude of the DNA databases (around 3,000,000 entries in CODIS), the figure quoted in court would still be astronomical. Neither approach necessarily makes matters any easier for the defendant in a cold hit case. But at least the court would not be acting in ignorance of, let alone counter to, scientifically established fact.

Keith Devlin is the Executive Director of the Center for the Study of Language and Information at Stanford University. This article was in part adapted from the author's September 2006 column in MAA Online, the monthly online newsletter of the Mathematical Association of America.


 

CRG
5 Upland Road, Suite 3 Cambridge, MA 02140
p: 617.868.0870
f: 617.491.5344

e: crg@gene-watch.org