The Lindbergh Kidnapping Hoax
Lindbergh's Questionable Identification of Hauptmann
Hearing Voices: Speaker Identification in Court
by Lawrence M. Solan and Peter M. Tiersma
Directory Books Search Home Forum Transcript Sources
11/10/02 WKMG Investigates 3/1/03 WKMG New Evidence 4/1/03 More WKMG
Charles Lindbergh's ENTIRE Flemington Court Testimony
(PDF)
The New York Times Lindbergh Archive
BBC
NEWS story about earwitness accounts in court
New!
3/30/04 The
State of NJ v. Bruno Richard Hauptmann: FAIRNESS ON TRIAL
(PDF) by
Judge W Dennis Duggan, JFC
reprinted from The
Albany County Bar Association Newsletter 01/04 Important
News! 8/20/03
Forensic
Evidence Removed By American Lindbergh Family 
University of California, Hastings College of Law
Hastings Law Journal
January, 2003
Hearing Voices: Speaker Identification in Court
by Lawrence M. Solan* and Peter M. Tiersma**+
Introduction - The Hauptmann Trial
Part I - examines the foundation required for a voice identification to be admissible.
Part II - empirical research regarding the reliability of voice identification
Foreign languages, accents, etc
Part III - role of experts in speaker identification cases
Part IV -suggestions for improving the legal system's handling of voice identification
* Professor of Law and Director, Center for the Study of Law, Language and Cognition, Brooklyn Law School; Ph.D., University of Massachusetts (Linguistics) (1978); J.D., Harvard Law School (1982).
** Professor of Law and Joseph Scott Fellow, Loyola Law School, Los Angeles; Ph.D., University of California, San Diego (Linguistics) (1980); J.D., University of California, Berkeley (1986).
+ We would like to thank Margaret Berger, Ton Broeders, Laurie Levinson and Michael Risinger for their valuable comments, and Mary Ann Buckley, Andrew Lipton, Stacy Rotner and Brendan Ryan of Brooklyn Law School, and Irene Farinas of Loyola Law School for research assistance.
On April 3, 1936, Bruno Richard Hauptmann was executed in Trenton, New Jersey, for kidnapping and murdering the Lindbergh baby. The most dramatic moment in the trial came when Charles Lindbergh identified Hauptmann's voice as that of his baby's kidnapper. Three years earlier, still hoping to get his son back alive, Lindbergh had accompanied Dr. John Condon to St. Raymond's cemetery in the Bronx. Condon had gone there to deliver ransom money to the kidnapper, while Lindbergh waited some seventy to one hundred yards away in the car. n1 Out of the darkness came the words, "Hey, doktor! Over here, over here," pronounced with a foreign accent. n2 Twenty-nine months after the encounter in the cemetery, in September 1934, Lindbergh told a Bronx grand jury that "it would be very difficult to sit here and say that I could pick a man by that voice." n3 Nonetheless, the district attorney asked Lindbergh later that day: "Would you like to see the man who kidnapped your son?" n4 The next morning, while Lindbergh sat in the back of the D.A.'s office among a group of detectives, Hauptmann was brought in and [*374] asked to repeat the words, "Hey, doctor. Here, doctor, over here." n5 Lindbergh told the prosecutor that he recognized Hauptmann's voice as that of the kidnapper. n6
At trial, Lindbergh recounted the events at the cemetery. n7 He then testified
Q. Whose voice was it, Colonel, that you heard in the vicinity of St. Raymond's Cemetery that night, saying, "Hey, Doctor?"
A. That was Hauptmann's voice.
Q. You heard it again the second time where?
A. At District Attorney Foley's office in New York, in the Bronx. n8
Lindbergh's lawyer commented: "The minute Lindbergh "pointed his finger' at Hauptmann, the trial was over. "Jesus Christ' himself said he was convinced this was the man who killed his son. Who was anybody to doubt him or deny him justice?" n9
Sixty-five years later, Hauptmann's conviction and execution remain controversial. Was he truly guilty of the kidnapping, or was he wrongly convicted, perhaps in part because of the anti-German sentiment then prevalent in the United States? We will not attempt to answer the ultimate question of Hauptmann's guilt or innocence. But we will address others that the Lindbergh case raised: Are people really able to remember a voice that they have only heard once? Are three syllables enough of a sample? Isn't twenty-nine months a long time? Does the stress of the situation make memory better or worse? What effect does a foreign accent have on our ability to identify voices?
Hauptmann's attorney was in no position to present expert evidence on any of these questions. There was no relevant expertise at that time. But the case did trigger experimentation by psychologists into the question of how good we are at identifying voices. n10 Since then we have learned a great deal more. Nonetheless, the law has virtually stood still since Hauptmann's execution. Generally, people who believe they recognize a voice are simply [*375] allowed to take the stand and say so. As we will see, the American legal system is all too often wrong in its assumptions about people's ability to recognize voices. Other countries fare no better. In 1995, a Canadian appellate court exonerated Guy Paul Morin, whose conviction three years earlier of raping and murdering a young girl was largely based on an inaccurate identification of his voice. n11 Post-conviction DNA testing excluded him as the perpetrator. n12
[NOTES 1 -12 pertaining to the Hauptmann Trial. See other footnotes below]
n1. Ludovic Kennedy, The Airman and the Carpenter: The Linbergh Kidnapping and the Framing of Richard Hauptmann 266-67 (1985).
n2. John F. Condon, Jafsie Tells All! Revealing the Inside Story of the Lindbergh-Hauptmann Case 149 (1936).
n3. Jim Fisher, The Lindbergh Case 248 (1987).
n4. Id.
n5. Id. at 249.
n6. Id. at 249-50.
n7. In his testimony, Lindbergh left out the words, "over here." Transcript at 109, State v. Hauptmann, 180 A. 809 (N.J. 1935) (No. 99). In fact, there is dispute over what the kidnapper actually said. Lindbergh had apparently told the grand jury that the words were, "Hey, doc," Kennedy, supra note 1, at 209. Others quote the kidnapper as having said, "Hey, Doctor! Hey, Doctor, over here." George Waller, Kidnap: The story of the Lindbergh Case 75 (1961). We thank Ronelle Delmont for providing us with a compact disk containing the transcript.
n8. Transcript at 113-14, Hauptmann (No. 99).
n9. A. Scott Berg, Lindbergh 315 (1998).
n10. This is literally true. See Frances McGehee, The Reliability of the Identification of the Human Voice, 17 J. Gen. Psychol. 249 (1937), discussed infra note 169.
n11. See Regina v. Morin, 37 C.R. (4th) 395 (Ontario Ct. App. 1995). The voice identification aspect of the case is discussed in A. Daniel Yarmey, A. Linda Yarmey, Meagan J. Yarmey & Lisa Parliament, Commonsense Beliefs and the Identification of Familiar Voices, 15 Applied Cognitive Psychol. 283 (2001).
n12. Morin, 37 C.R. (4th) 395
Many of the problems that people have identifying speakers from their voices are similar to those that people have as eyewitnesses. The amount of exposure, the nature of the identification process, and the number of exposures all matter in determining how likely a witness is to be correct. n13 Yet while the reliability of eyewitness identification has been a focal point in the news, the scholarly literature, and the courts, n14 the unreliability of earwitness identification has gone virtually unnoticed in the case law and legal literature. The reluctance of the legal system to deal with this problem stems from a confluence of ignorance, rigid adherence to historical positions that are no longer tenable, and some interesting judicial missteps concerning the accuracy of "voiceprints" that have made courts unreceptive to voice identification research.
Part I of this Article examines the foundation required for a voice identification to be admissible. The Supreme Court has held that a party offering eyewitness identification made under suggestive circumstances must demonstrate indicia of reliability to comply with due process standards. n15 While courts sometimes apply these same standards to earwitness identification cases, they often fail to give earwitness testimony even that much scrutiny, basing their decisions on the rules of evidence governing authentication, which require very [*376] little foundation. n16 We will argue that some of the cases require so little evidence of reliability that they are unconstitutional even under current standards. We will also point out serious risks of misidentification that call for some adjustment in the legal rules.
In Part II we summarize some of the empirical research regarding the reliability of voice identification, research that the legal system has by and large ignored. This omission is significant. For one thing, if reliability is the basis for constitutional analysis of suggestive identifications, knowing what makes an identification more or less reliable is a prerequisite for intelligent inquiry. Moreover, in establishing such a low threshold for admissibility the system leaves judgments of reliability to the jury. But unless information is presented to jurors through other actors in the legal system, whether through cross-examination, the introduction of expert testimony, or informative jury instructions, jurors will have no reasonable basis for making this judgment.
We discuss a number of factors that have an impact on reliability. Some people are good at identifying voices - others are terrible at it. Memory for voices stays with us for a few weeks, but then degrades if not reinforced. Longer exposures produce better recollection - up to a point. Repeated exposure helps a great deal. People have trouble recognizing voices that they earlier heard with a different tone of voice. People between the ages of sixteen and forty are best at recognizing voices. People are better at remembering voices that they heard live rather than on tape or over the telephone. People are generally bad at recognizing the voices of those speaking foreign languages and are not very good at recognizing the voices of those speaking their own language with a foreign accent.
Part III considers the possible role of experts in speaker identification cases. The legal system's excessive confidence in voice identification technology in the 1960s and 1970s has made the system especially suspicious of experts even several decades later. Courts had placed increasing hope on the ability of machines, called sound spectrographs, to create "voiceprints" unique to each speaker. Earlier, voiceprint analysis had been rejected by courts as unreliable. With the adoption of the Federal Rules of Evidence in 1975 came a softening in judicial resistance. At about the same time, certain prominent phoneticians who had opposed the use of sound spectrography in courts began to support it as the technology improved. Some, but not all courts, began to permit voice comparisons by spectrographic experts. Then, in 1979, the National Academy of Science issued an influential report, arguing that there is insufficient evidence that spectrographic analysis is reliable enough [*377] for forensic application. n17 The result was abandonment of spectrographic analysis in court proceedings in most jurisdictions, even those whose courts had approved it just a few years earlier. Nonetheless, a few courts, relying on a superficial understanding of this history, continue to allow it. The Supreme Court of Alaska has only recently permitted its use for the first time. n18
Unfortunately, this focus on spectrographic analysis has deflected serious inquiry into how good ordinary people are at identifying speakers from their voices and whether experts are any better than untrained people. A great deal has been written on these issues, especially in the past ten years, but almost none of it has made its way into American legal discourse, which associates questions of speaker identification with the voiceprint issue. To be fair, much of the basic research has been conducted outside the United States and is not readily accessible to those who do not know where to look. But it is now time for the system to consider this information and to put it to appropriate use.
As for what experts can do, there is still debate. But experts are at least somewhat better than lay people at comparing voice samples aurally, which may be especially significant in cases requiring comparison of a voice on a tape recording to that of a witness or the defendant. Although courts have been divided on the matter since the Supreme Court's 1993 Daubert n19 decision, we do not recommend that the system accept the testimony of those who hold themselves out as voiceprint experts. However, when used in limited circumstances by experts in phonetics, spectrographic information can sometimes augment aural voice analysis in ways that can be helpful to the legal system. Moreover, experts can bring the weaknesses of lay identification to the attention of judges and juries. Courts have long recognized that identification of a person by voice alone presents "grave dangers of prejudice to the suspect." n20
Finally, in Part IV, the Article concludes with a number of suggestions for improving the legal system's handling of voice identification. Among our recommendations are the adoption of non-suggestive identification procedures, the use of reliability criteria in tune with the scientific research, the admission of expert testimony to bring to the jury's attention aspects of an identification that reduce its reliability, and informative jury instructions.
[*378]
I. Legal Standards for Identifying Speakers
Lindbergh's identification of Hauptmann could not have been more suggestive. Having just sworn to a grand jury that he would not likely be able to identify the kidnapper's voice, in a one-on-one encounter Lindbergh was presented with an individual whom the D.A. said was the person who killed Lindbergh's son. That person was Hauptmann. Lindbergh indeed identified Hauptmann's voice and testified at trial that he had done so. Since the Hauptmann trial, the basic evidentiary principle has remained the same: Someone who heard a speaker's voice can take the witness stand and identify that voice, subject to the opposing party's right to cross-examine.
However, during the past three decades, the Supreme Court has held that due process considerations require inquiry into the reliability of an identification when it is found to be suggestive. In Section A, we look at the application of these requirements in earwitness identification cases. In Section B, we examine a set of cases in which the voice being identified is on tape. In those cases, courts apply rules governing authentication and require only minimal familiarity with a voice for an identification to stand. As we will see, courts sometimes pay too little attention to suggestiveness and reliability issues in tape cases and occasionally even apply the minimal standards of tape cases to live identifications.
A. Due Process Requirements
(1) The Biggers Criteria
The seminal case in both eyewitness and earwitness identification is Neil v. Biggers, decided by the Supreme Court in 1972. n21 That case involved a crime victim's identification of the defendant at a "showup" - a procedure in which the police march the suspect in front of the victim and ask for identification, without the safeguard of requiring the victim to choose the defendant from among a group of people in a lineup. Approximately seven months after the crime occurred, officers showed the defendant to a rape victim at a police station, where she had an opportunity to look him over and hear him utter the words "shut up or I'll kill you." n22 Based on his appearance and voice, she testified at trial that she had "no doubt" that the defendant was her assailant. n23
The Court concentrated on the eyewitness aspect of the identification, and established a framework for evaluating claims that [*379] a defendant's right to due process was violated by a suggestive identification procedure. n24 The focus should be on the risk of a false identification
It is the likelihood of misidentification which violates a defendant's right to due process... . Suggestive confrontations are disapproved because they increase the likelihood of misidentification, and unnecessarily suggestive ones are condemned for the further reason that the increased chance of misidentification is gratuitous. But as Stovall makes clear, the admission of evidence of a showup without more does not violate due process. n25
The Court then articulated criteria for evaluating the likelihood of misidentification:
We turn, then, to the central question, whether under the "totality of the circumstances" the identification was reliable even though the confrontation procedure was suggestive. As indicated by our cases, the factors to be considered in evaluating the likelihood of misidentification include the opportunity of the witness to view the criminal at the time of the crime, the witness' degree of attention, the accuracy of the witness' prior description of the criminal, the level of certainty demonstrated by the witness at the confrontation, and the length of time between the crime and the confrontation. n26
Using these criteria, the Court held that the rape victim's identification of the defendant as her assailant was good enough to pass muster. n27 It noted that the defendant's appearance was consistent with a description she had given police shortly after the crime occurred. n28 Moreover, she had previously been shown several other suspects and had failed to single out any one of them as the rapist. n29 The Court was also impressed with her level of confidence in the identification. n30
In a subsequent case, Manson v. Brathwaite, the Court elaborated on the decision in Biggers. n31 It held that "the corrupting effect" of suggestive procedures must be balanced against indicia of reliability, which is "the linchpin in determining the admissibility of identification testimony." n32 If the identification is reliable, then it [*380] should be allowed notwithstanding improperly suggestive procedures. n33 This creates a two-step analysis in cases of this sort. First, a court should ask whether the identification was suggestive, and second, if it was suggestive, whether it was nonetheless reliable under the criteria set forth in Biggers.
Although Biggers and Manson both concentrated on eyewitness identification, it is worth bearing in mind that the victim in Biggers was exposed not only to the defendant's appearance, but also to his voice. There is no reason for courts to limit the holdings of these cases to eyewitness identification, and they do not. As we will see below, some courts have applied Biggers and Manson to voice identification, and no court has said that the two-step analysis should not apply when auditory identification is in issue. Thus, Biggers and Manson set the constitutional standard for admitting voice identification evidence in earwitness cases.
(2) Due Process Analysis in Voice Identification Cases
The initial question in a Biggers/Manson analysis is whether voice identification procedures were overly suggestive. If the procedure is held not to have been suggestive, courts generally do not apply the due process analysis and must decide whether to admit the identification under ordinary rules of evidence. The threshold for admission of a voice identification under these rules is very low. n34
The surest way to guarantee that voice identifications are not excessively suggestive is to use an appropriately constituted voice lineup. Consider the description by the Supreme Judicial Court of Massachusetts of a permissible procedure using a five-voice lineup:
After consulting with the office of the district attorney, the police used a voice identification procedure that adequately protected the defendant's rights. There was no one-on-one confrontation between the victim and the defendant. The victim could not see the participants during the procedure, nor could they see her. The defendant selected the order in which he would read. The participants read the same innocuous passage from a fifth-grade reader. Defense counsel attended the procedure and, although consulted, never objected to it. In addition, we have viewed a videotape of the voice identification procedure, and conclude that the procedure was not impermissibly suggestive. The defendant's voice did not stand out because of his age, nor did any other aspect of the procedure direct undue attention to the defendant's voice. Hence, we conclude that the judge properly denied the defendant's motion to suppress the voice identification. n35
[*381] We do not endorse the Massachusetts procedure as flawless. For example, there probably should be more than five voices in a lineup. n36 But the court was clearly making a reasonable effort to ensure that fair procedures would be used.
In contrast, one-on-one voice identifications are almost inherently suggestive. For example, in Yeatman v. Inland Property Management, Inc., a federal district court rejected an identification when "only one tape containing only one female voice was played." n37 Moreover, the witness "already knew the critical need to give an affirmative answer to the question that she was being asked. And no opportunity was given to [the opposing parties] to participate in or to monitor the procedure." n38 The court likened the identification process to a "card trick" where "a magician forces on the person chosen from the audience the card that the magician intends the person to select, and then the magician purports to "divine' the card that the person has chosen." n39 Similarly, State v. Johnson, a New Jersey case, involved a woman who was raped by a man whose face she could not see, but whose voice she heard. n40 She was asked to come to the police station, and through an open door she heard the voice of a suspect talking to a detective. n41 After some initial hesitation, she identified the suspect as her assailant. n42 On appeal, the court noted that the constitutional safeguards that apply to visual identifications "are equally applicable to identification of a voice through auditory senses," particularly because the risks of misidentification "are even more apparent where the identification is by voice alone." n43 It concluded that the identification procedure was sufficiently suggestive to require a Biggers analysis of reliability. n44
In a Florida case, a man forced a woman off the road with his van. n45 She never got a good look at his face but heard him say, "lady, I'm going to rape you and kill you." n46 Thirty-two days later she was presented with a short tape recording of a detective interviewing a suspect, who mainly said nothing beyond invoking his right to silence. n47 She identified him as her assailant. n48 Later she attended a [*382] court session in which the same suspect was arguing that his bond should be reduced, and she again identified his voice as that of her assailant. n49 Because the victim was presented with only one possible voice in each situation, the appellate court held the procedure overly suggestive. n50
Likewise, a Massachusetts court held that requiring a suspect to utter the words of the perpetrator in open court, followed by the victim's identifying the voice as that of the defendant, was impermissibly suggestive. n51 In addition, a Pennsylvania court found it excessively suggestive when a rape victim identified her assailant based on hearing his taped telephone confession. n52 Courts have also found it improper to have a rape victim overhear a suspect being interviewed at a police station and then ask her if the man was her assailant. n53 Also improper was allowing a witness to see the defendant during the voice identification process, casting doubt as to whether it was really the defendant's voice that the witness was identifying. n54
Other courts have accepted some questionable identification procedures. One Connecticut case held that a lineup consisting of just two voices was not overly suggestive. n55 The same result was reached in a Louisiana case when a defendant's voice was one of three in a voice lineup. n56
The second part of the two-step Biggers/Manson process is to apply the Biggers criteria to determine whether a suggestive voice identification was nonetheless sufficiently reliable.
For example, in Commonwealth v. Marini, the Massachusetts court, having found identification procedures unduly suggestive, went on to find the identification unreliable when nine months had elapsed between the crime and the identification. n57
[*383] Most of the time, however, courts find adequate reliability. United States v. Duran n58 is typical. There, the Ninth Circuit affirmed a conviction for bank robbery. n59 Key evidence consisted of the tellers' identification of the defendant's voice at trial. n60 In response to the defendant's argument that the identification was excessively suggestive, the Court applied the Biggers factors, which the Ninth Circuit had adopted for in-court eyewitness identifications. n61 The Court concluded: Again, both tellers had ample opportunity to listen to Duran's voice during the robbery. Duran ordered the tellers to raise their hands and demanded money. He ordered [a] teller to escort him into the vault and to open it up. Inside the vault, Duran continued to holler at [the teller], demanding the keys to the vault, telling her to hurry, and asking where all the money was. He ordered [the teller] back to the teller area and demanded the keys to the remaining cash drawers. As Duran left, he threatened everyone in the bank: "don't move or we'll kill you." Both tellers were likely very attentive during the robbery given Duran's weapon and threats, as evidenced by their accurate descriptions of Duran and his distinctive voice and the fact that neither teller equivocated in her identification of Duran's voice. Moreover, the in-court identifications occurred just three months after the bank robbery. n62
This analysis contains some questionable assumptions. For one, three months may be a long time to remember a voice. For another, the court's conclusion about the degree of attention is only that the tellers were "likely attentive." n63 The opinion demonstrates a judicial recognition that reliability is an issue but does not provide much analysis of what makes an identification reliable or unreliable. n64
A more convincing case for reliability was made in United States v. Degaglia. n65 Although the Seventh Circuit did not explicitly apply the Biggers factors, it took seriously the fact that the agent identifying the defendant's voice testified that he had heard it on several occasions for periods of up to one and one-half hours and that the defendant had a very distinctive voice, which the agent described as "high pitched, raspy, and nasal." n66 Here, our intuitions are that the identification is more likely to be reliable. Courts, in fact, frequently [*384] hold identifications to be reliable when the witness testifies to having heard the voice on multiple occasions. n67 We will see that repeated exposures to a voice really does have significant impact on accuracy in experimental studies. n68 The research suggests that the inverse is also true: People do far worse identifying a voice that they have heard only once. n69
In summary, when an identification has occurred under suggestive circumstances, courts require some indication that the identification was nonetheless reliable as a condition for admissibility. Non-suggestive identifications, in contrast, are not subjected to scrutiny of their reliability. That issue is left to the trier of fact. In Part II, we will explore factors that affect the reliability of voice identification.
B. Voices on Tape
(1) Rule 901 and the Minimalist ApproachWhen the voice being identified is on tape, courts often do not engage in the Biggers/Manson analysis. They do not consider how suggestive the identification was and do not analyze the indicia of reliability even when the identification was suggestive. Rather, applying Rule 901 of the Federal Rules of Evidence or a similar rule, they permit a witness, often a police officer, to identify the voice and leave the question of reliability to the jury. n70
Rule 901 states Requirement of Authentication or Identification
A) General provision. The requirement of authentication or identification as a condition precedent to admissibility is satisfied by evidence sufficient to support a finding that the matter in question is what its proponent claims.[*385]
(B) Illustrations. By way of illustration only, and not by way of limitation, the following are examples of authentication or identification conforming with the requirements of this rule:
(5) Voice identification. Identification of a voice, whether heard firsthand or through mechanical or electronic transmission or recording, by opinion based upon hearing the voice at any time under circumstances connecting it with the alleged speaker. n71
The advisory committee notes accompanying the rule make it clear that experts should generally not be part of the process: "Since aural voice identification is not a subject of expert testimony, the requisite familiarity may be acquired either before or after the particular speaking which is the subject of the identification, in this respect resembling visual identification of a person rather than identification of handwriting." n72
The rule requires only "evidence sufficient to support a finding" that the tape is what it purports to be, and just about anything is sufficient. n73 The following statement by the United States Court of Appeals for the Ninth Circuit is typical:
Rule 901(b)(5) establishes a low threshold for voice identifications offered to determine the admissibility of recorded conversations. So long as the identifying witness is "minimally familiar" with the voice he identifies, Rule 901(b) is satisfied. The record reveals that Speziale was present in Anchorage at Plunk's initial post-arrest interview. The familiarity that he gained through that exposure was sufficient under Rule 901(b)(5) to support his identification of Plunk's voice on the tape recorded conversations being offered into evidence. n74
The court took it as a given that the witness had gained sufficient familiarity with the suspect's voice to identify it, despite his obviously very limited exposure. In many, if not most cases, this "minimally familiar" approach of Rule 901 does not appear to create significant risk of misidentification. For one thing, the identification often does have substantial indicia of reliability. For example, sometimes the person identifying the voice actually participated in the tape-recorded conversation. n75 In other cases, the identifying witness almost certainly [*386] was familiar enough with the voice to identify it correctly. Experience and research n76 support the intuitions of judges that people typically can identify the voice of a close relative n77 or that someone who has heard a voice fifty or sixty times is likely to recognize it if he hears it on a tape. n78
Moreover, the circumstances under which the tape was made usually provide ample evidence of reliability to reduce concerns about possible due process violations. When an officer investigating the defendant's conduct has recorded a wiretapped conversation between two people, and one of the speakers used the wiretapped phone at the defendant's residence, responded to being called by the defendant's name, and said the kinds of things the defendant often said, the defendant was most likely one of the speakers. n79 Thus, when the circumstantial evidence is robust, the risk of error is rather low, even if the identifying witness's only experience with the voice was when he heard the defendant speak six months earlier at the defendant's arrest. n80
Perhaps the most compelling circumstance is the existence of the tape itself. Of course, it is possible to misidentify a voice on a tape. But the tape limits the range of plausible identifications, and its very existence provides the defendant with the opportunity to dispute the identification through witnesses who testify to the contrary. Furthermore, if a police officer identifies the speaker based on minimal exposure to his voice, the identification obviates the need to find and subpoena neighbors, former teachers, or others to do the same thing. If it does not add significantly to the likelihood of misidentification (an issue to which we return below), then the pro forma authentication is efficient and not unfair.
In fact, courts have long permitted voice identification solely by circumstantial evidence when the voice is on the telephone at a number that the caller contacted. In one 1924 case decided by the [*387] First Circuit, three defendants appealed their mail fraud convictions. n81 Part of the evidence involved telephone confirmations of dishonest stock sales. n82 Regarding the identification of the voices on the telephone as belonging to the defendants, the court held:
The gist of the problem is whether there was sufficient identification of the persons in the Boston office calling the witnesses, or later responding to the calls of the witnesses, so as to justify the court in admitting the evidence. Plainly, recognition of the voice is not the only means of identification. Circumstantial evidence may be as persuasive as testimony that the voice is recognized. n83
Since the adoption of the Federal Rules of Evidence in 1975, n84 federal courts continue to permit identification of voices on the telephone by circumstantial evidence alone. n85 Occasionally, courts do not even require that a taped voice be authenticated by someone with knowledge of the voice, relying on telephone identification cases, and not on Rule 901(b)(5). n86
Professors Risinger, Denbeaux, and Saks point to similar issues in the law governing the identification of documents. n87 While the most compelling reason to consider a document as authentic often involves the circumstances in which it was discovered (e.g., in the defendant's desk drawer), the system also requires some formal identification of the defendant as the document's author. n88 The acceptance of handwriting experts grew out of the need to provide an identifying witness when there was no other witness available. n89 To the extent that courts treat the personal identification as a formality, relying principally on the circumstantial evidence, the issues raised with respect to speaker identification mimic those that arise in the authentication of documents. n90
[*388] When defendants raise objections to intuitively unreliable identifications of voices on tape, courts typically admit the evidence anyway, leaving it to the defense to attack its validity at trial. In essence, once a tape is admitted, its contents are considered conditionally relevant, n91 depending on whether the jury ultimately decides that it really is the defendant's voice on the tape. Below is an excerpt from a recent Sixth Circuit case, United States v. Knox. n92 The officer identifying the defendant's voice had heard it only once some three years before his testimony at trial:
Special Agent Collins's testimony that he recognized Sam's voice as that of Knox based on a conversation some three years earlier was viewed with some skepticism by the district court. There is no question, however, that this testimony is adequate to authenticate Sam's voice as that of Knox for purposes of admissibility in conformance with Fed. R. Evid. 901. All that is required under the rule is that the identifier, Collins, have heard the voice of the alleged speaker, Knox, "at any time." If the district court meant to set a stricter standard - about which point we are, concededly, uncertain - it abused its discretion. Of course, it will certainly be open to the defendant to argue to the jury that Collins is simply wrong, and that it is improbable that Collins could remember in 1996 a voice he heard in 1993. This is a question of the weight of the evidence, however, not of its admissibility. n93
Even on its own terms, the Sixth Circuit's opinion reveals the need for greater linguistic sophistication in voice identification cases. The court invites defendants to argue that the witness identifying the voice "is simply wrong, and that it is improbable" that he could remember a voice that he had heard three years earlier. n94 But what makes it improbable? If people typically remember voices that well, then it is not improbable. If they don't, then it is improbable. This is, in other words, an issue that can and should be informed by scientific research. As an initial matter, it will be up to defense attorneys to educate themselves sufficiently to raise these issues, whether in pretrial motions or at trial.
Likewise, the Fifth Circuit affirmed a conviction that rested in part on the expert testimony of an FBI agent who identified the voice on a tape as the defendant's. n95 It did not matter that the FBI "used regular agents to make the identifications rather than using the voice identification experts it has on staff," or that "the prosecution [*389] admitted it had misidentified some of the parties on these phone calls." n96
Applying Rule 901, the Tenth Circuit has also required very little evidence to support voice identifications, n97 and the Second Circuit has specifically rejected what it called a "rigid standard" regarding the admissibility of tape-recorded evidence, noting that authentication merely makes the tapes admissible, leaving the issue of reliability to the jury. n98 Similarly, an appellate court in New York state was recently satisfied when an officer identified a voice on a tape based on "a lengthy conversation he had with defendant on the day of the arrest" some fifteen months earlier. n99
The Seventh Circuit seems to apply a somewhat stronger standard of reliability. For a tape recording to be admissible, the government must establish by clear and convincing evidence that it is a "true, accurate, and authentic recording of the conversation, at a given time, between the parties involved." n100 Yet what appears to be a higher burden of proof largely dissipates in the great deference that appellate courts give to the trial judge's decision. Such a decision is not overturned on appeal absent "extraordinary circumstances." n101
Consider, for example, the Seventh Circuit's holding that a two-hour conversation that a police officer had in English with a defendant some two years earlier was enough to permit the officer to identify a voice on a tape as being that of the defendant. n102 Although reciting the "clear and convincing" standard, n103 the court held that "questions concerning the length of Officer Johnson's previous contact with Vega or the time between this contact and the identification, simply go to the weight the jury accords this evidence, not to its admissibility." n104 It made no difference that the voice on the tape was speaking Spanish. Thus, even jurisdictions that appear to have a somewhat higher standard turn out, in reality, to apply the rule similarly.
[*390]
(2) Some Problems with the Minimalist Approach of Rule 901
The Supreme Court has held that the key issue for purposes of due process analysis is the likelihood of misidentification. n105 In many cases involving tapes, the identification procedure is suggestive, and the witness's exposure to the voice is so minimal and, at times, so long ago, that the identification cannot seriously be considered reliable. n106 In these cases, only the circumstances taken as a whole can provide the indicia of reliability necessary to meet due process considerations. When those indicia are absent, courts should take a closer look, notwithstanding Rule 901's minimal standards.
Most opinions do not describe the procedure by which the witness identified the defendant's voice. Yet it is only reasonable to infer from these cases that the procedure is typically highly suggestive. Much of the time, it appears that the prosecutor contacts an officer who heard the defendant's voice, perhaps during an earlier arrest, and asks him to identify it on a tape. n107 If it were otherwise, the many opinions that reject challenges to authentications would highlight additional facts that show the identification to be non-suggestive. Moreover, if courts do not require more, why should the prosecutor risk being unable to authenticate the tape by making the identification procedure harder for the police officer than the law requires?
Most courts have not applied Biggers/Manson criteria to the identification of voices on tape. The few that have done so have invariably found no due process violation, even when the suggestiveness of the identification rings out. Consider United States v. Zambrana, where a government agent identified the defendant's voice as being on a tape recording and the court admitted it under Rule 901. n108 The evidence of suggestiveness was particularly strong. As the agent listened to a tape, he had a transcript that listed the defendant's name in the margin. n109 Nonetheless, the court held that [*391] this was not overly suggestive because the agent then went on to identify the defendant's voice on additional tapes without a transcript. n110 Clearly the damage had been done. An initially suggestive identification can taint later ones, and that was precisely the risk in this case. n111
Similarly, in United States v. Degaglia, the defendant was recorded speaking on the telephone with a government informant. n112 Later, he was arrested by an agent Olson, who spoke with him for around ninety minutes at that time and also on a couple of subsequent occasions. n113 At trial, Olson identified the defendant's voice as being on the tape. n114 The court rejected the defendant's contention that the identification was overly suggestive, even though Olson knew he was being called to court to identify the defendant's voice on the tape. n115 If Degaglia wanted to claim that it was not his voice on the tape, he would have to do so at trial. n116
The procedure in Degaglia was very much like a showup. Rather than being presented with a number of candidates from which to choose, the identifier was just being asked to answer "yes" or "no" to a single proposed candidate. Almost all courts, as we observed in the previous section, would regard this procedure as impermissibly suggestive in the earwitness context and would require a Biggers analysis of reliability. In contrast, courts that look at suggestiveness in the context of authentication of tapes by the police have found the procedures to be adequate. n117
At the same time, it appears that the objections that defendants raise to these identifications are almost always formal ones. The defendant claims that the identification did not meet legal standards [*392] but provides no reason to believe that the identification is not accurate, by producing alibi witnesses or witnesses who disagree with the identification, for example. The absence of such evidence tends to support the notion that Rule 901's minimal standards regarding the identification of speakers on tape recordings do not generally result in a serious risk of misidentification.
Nonetheless, there are several situations in which we believe the system to be far too casual. First, if a defendant comes forward with facts that raise concerns of a mistaken identification, then courts should apply Biggers and evaluate the potential for misidentification before admitting a suggestive identification of a taped voice based on minimal exposure to it. Otherwise, we agree with most courts that such analysis is generally not necessary as a prerequisite for admission of a tape recorded voice into evidence. This balance requires greater indicia of reliability when lax standards for admissibility create a demonstrable risk of error. The suggested procedure, we believe, is both loyal to the streamlined authentication process envisioned in Rule 901 and respectful of the due process concerns that pervade this area of law. On the one hand, courts will not be tempted to say that suggestive identifications are not suggestive, creating precedents that might affect later decisions when due process analysis really matters. On the other hand, when evidence is produced that casts doubt on a suggestive identification, the mechanical nature of Rule 901 will not override due process concerns.
Second, courts should be careful not to admit a suggestive identification of a recorded voice when the circumstantial evidence is not strong. Consider a recorded bomb threat made from a public telephone in a major city. Suppose that police arrest a malcontent whom they suspect to have made the call. Should we allow one of the arresting officers, who heard the suspect say a few words while being booked, to testify that the voice on the tape belongs to the defendant? Not only might the call have been made by any one of millions of people, but telephones transmit only a limited range of acoustic information, which is even further degraded by a tape recorder.
Third, when there is no tape recording, the Biggers/Manson analysis should clearly apply. A recent Fifth Circuit case applied only the minimalist approach of Rule 901 in a case where a police officer identified the defendant's voice as the one he had heard in a telephone surveillance. n118 No tape was made. The facts do not show whether the identification was suggestive enough to trigger application of the Biggers criteria, but the issue should have been [*393] raised and discussed. Courts should not give the impression that due process considerations are suspended when it is the police who make the identification. As noted earlier, the Second Circuit has defended Rule 901's minimalist approach to identifying voices on tape precisely because the existence of the tape reduces the likelihood of misidentification. n119 When no tape exists, the rationale disappears.
In deciding whether an identification meets due process standards, courts must determine the reliability of lay voice identification. Moreover, once the evidence is admitted, jurors must determine how reliable the identification was. As we will see, however, people are not very good at identifying voices, especially under certain circumstances that have been the subject of considerable study. And despite the tendency of many courts to leave the issue to the jury, people do not always have good intuitions about how much to rely on the accuracy of earwitness identifications. We turn to the question of reliability below.
II. Voice Recognition Research and the Reliability of Identifications
People make mistakes identifying voices even under the best of circumstances. Guy Paul Morin's DNA exoneration in Ontario is a startling recent reminder. n120 One of the earliest known cases of speaker misidentification is the trial of William Hulet, who was accused of having executed King Charles I. n121 Once the monarchy was restored under Charles II, one of its first orders of business was to prosecute for treason those involved in the regicide. The evidence against Hulet consisted almost entirely of rumor and innuendo, much of which would be excluded as hearsay today. Especially probative was testimony by Richard Gittens, who not only was a witness to the execution, but also belonged to the same regiment as Hulet did at the time. Gittens testified that he had heard the executioner, whose face was obscured, beg the king's forgiveness and that he knew that it was Hulet "by his speech." n122 Cross-examined later by Hulet himself, who asked him how Gittens knew that he (Hulet) had been on the scaffold at the time, Gittens replied, "By your voice." n123 After deliberating for more than the usual time, the jury returned to declare Hulet guilty of high treason, n124 the punishment for which was normally a quite gruesome death. This case might seem to be of little more than [*394] passing historical interest but for a footnote inserted by the editors. They report that the actual perpetrator was the ordinary hangman, who later confessed, and that the court, "being sensible of the injury done to [Hulet], procured his reprieve." n125
Rule 901(b) assumes that people are basically accurate in identifying voices and realistic in their assessments of how likely it is that an identification is correct. In this section, we examine some research that suggests these assumptions are not entirely correct. First, we look at research that addresses how good people generally are at identifying voices. We will see that people are very accurate when it comes to recognizing voices they know well but are much less so with unfamiliar ones. We will also see that issues such as the number of exposures to the voice, the delay in making the identification, the skill of the identifier, and the presence of stress or disguise, all play roles in determining whether an identification is likely to be accurate.
We then ask whether we are realistic in our estimates of how accurate identifiers are likely to be. We will see that there is little or no relationship between confidence and accuracy, but jurors are likely to take the identifier's level of confidence very seriously. We will also see that people tend to overestimate the ability of others to identify voices.
A. Factors Affecting the Reliability of Voice Identification
Researchers have uncovered a number of factors that make voice identification easier or harder. Much of this work has been conducted in Europe, Canada, and Australia and has thus been less accessible to the American legal community. n126 These researchers have found that familiarity with a voice, knowing in advance that one will later have to identify a voice, length of exposure, the language being spoken, foreign accents, length of the delay in performing the identification, and other factors play significant roles in people's ability to identify voices. Most of these factors are completely absent from any discussion in the case law. In contrast, a witness's confidence in the accuracy of the identification, which courts sometimes consider relevant, does not correlate substantially with correctness of identification. In this section, we look at empirical data that teases out many of these factors.[*395]
(1) Familiarity
Just about everybody would assume that people are better at identifying familiar voices than unfamiliar ones. The assumption is largely correct, yet questions remain. How much difference does familiarity make? Does it matter much how familiar the voice is? What is the rate of error despite familiarity?
Some of these issues have recently been studied by Daniel Yarmey and a group of his colleagues. n127 In one study, sixty-eight people agreed to participate as "speakers." n128 Each recorded a sixty-four-word passage, and then two minutes of spontaneous speech. n129 The speakers were asked for the names of friends and associates who might participate in a subsequent voice identification study. n130 The speakers also identified themselves as belonging to one of the following categories with respect to each such friend or associate:
A high familiar speaker: "An immediate family member or best friend." n131
A moderate familiar speaker: "[A] co-worker, team-mate, club-mate, or general friend." n132
A low familiar speaker: "[A] casual acquaintance, such as next door neighbor or associate, who would be expected to have talked with the listener for only a few minutes on occasion in any week over the last year." n133
The speakers were asked not to discuss the experiment with any of the people that they named. n134
For each listener, the experimenters were able to find at least one speaker who was a high familiar speaker, one who was a moderate familiar speaker, and one who was a low familiar speaker. n135 The listeners were then presented with passages from four different voices: three that varied with degree of familiarity, and also an entirely unfamiliar voice. n136 Listeners were asked to say who the speaker was, if they could, as soon as they recognized the voice. n137 They then listened to the rest of the passage and were permitted to change their minds if they thought they had initially made a [*396] mistake. n138 This experimental format should produce more correct responses than one in which subjects do not have a chance to change their minds. n139
Some of the results of this study n140 are summarized in Table 1:
Table 1
Accuracy (percent) for Identifying Voices of Varying Familiarity n141
[see org] These results are striking in several ways. First, as expected, familiarity does matter. We are pretty good at recognizing the voices of people we know well (89% correct), not as good at identifying the voices of people we know casually (66% correct), and even worse at acknowledging that we don't know a voice at all (61% correct).
In addition, many of the errors are false alarms: identifiers say they recognize a voice as belonging to a particular speaker but are wrong. Because listeners had the choice of stating that they did not recognize a voice, one would have expected the total of "don't know" answers to increase as familiarity lessens. That was not the case, however. Instead, the false alarm rate went up as familiarity went down. Moreover, false alarms account for a substantial percentage of the errors for the unfamiliar voices; no less than 36% of subjects claimed to recognize a voice they had, in fact, never heard before.
The Yarmey group's study is not unique in this finding. Earlier work by Harry Hollien and his colleagues had reached a similar conclusion. n142 Hollien's team presented subjects with recordings of familiar and unfamiliar voices and then immediately tested them by asking them whether a series of voices matched the one they had [*397] heard. n143 The study found that when a normal tone of voice was used in the recording (as opposed to a stressed or disguised voice), n144 subjects identified familiar voices with 98% accuracy whereas accuracy dropped with unfamiliar voices to only around 40%, even with almost no lapse in time between the initial exposure and the identification. n145 Contrary to what most people would expect, fewer than half of the subjects were able to identify a previously unfamiliar voice they had heard only a brief time before.
These results confirm our intuitions that people are generally good at recognizing familiar voices. Yet they show remarkably high rates of error in identifying unfamiliar voices. The assumption made by many judges, that someone familiar with a voice can correctly identify it, thus appears to be partially correct. It is true that someone who is highly familiar with a voice can correctly select it from a limited range of alternatives. However, the presumption made by many courts that a policeman who briefly hears a voice once can later identify it on tape seems quite questionable and becomes more questionable as the number of potential target voices increases. The high rate of mistaken identification of unfamiliar voices, which parallels findings regarding eyewitness identification, is especially troubling because of its potential to lead to false convictions.
(2) Amount of Exposure
From the earliest days of voice identification research, experimenters have asked how much exposure to a previously unfamiliar voice is sufficient. n146 Whether we are concerned about a rape victim identifying the voice of her attacker, or a police officer identifying the voice of the defendant as the one on the tape, the legal system routinely deals with situations in which the witness identifying a voice had only brief exposure to it.
In another set of experiments performed by Professor Yarmey, subjects participated in a telephone conversation with the experimenter. n147 The length of the conversation was either short (average 3.2 minutes), medium (average 4.3 minutes), or long (average 7.8 minutes). n148 Subjects then received a second phone call, [*398] and were asked if they could identify the voice they heard in the first call out of a lineup of six voices presented in the second call. n149 Half the subjects heard a lineup that did not contain the first voice at all. n150 The other half heard a lineup that did contain the first voice. n151 Some of the subjects received this second call immediately after the first one (immediate test), some received it two hours later (two-hour delay), and others received it two or three days later (two/three-day delay). n152
The results, once again, are not intuitively evident. First, the interval between the two calls did not produce statistically significant results, suggesting that vocal memory does not decay for two or three days, given adequate exposure. n153 Second, the length of the original exposure did matter. n154 For subjects receiving a lineup that actually contained the target voice, 24% who had a short original conversation identified it, while 48% who had a long conversation identified it. n155 Third, the rate of false alarms went up among those receiving a lineup containing the target voice when the exposure to the voice was longer (14% versus 35%), and was even higher (48%, 51%, and 44%) for all three lengths of exposure when the target was not present. n156 Consistent with what we saw in the previous section, people asked to participate in a voice identification procedure seem predisposed to identifying someone, even if that means making a mistake. n157
Other researchers have found that the number of initial exposures to a voice (not just the length) is of critical importance. Defenbacher and his team report a study in which one group of listeners heard a sixty-second passage to which they were told to pay close attention. n158 When asked to identify that voice two weeks later out of a voice lineup containing nine voices, they were correct 29% of the time, made false alarms 14% of the time, and the rest of the time did not know. n159 A second group of subjects heard the same sixty seconds of speech, but it was divided into fifteen to twenty second segments and presented to them over the course of three consecutive [*399] days. n160 Their hit rate was a perfect 100%. n161 The authors concluded that a witness who hears sixty seconds of speech on one occasion is less likely to recognize the suspect's voice later than is someone who hears fifteen to twenty segments on three or four separate occasions. n162
Interestingly, when subjects heard a passage only half as long, even the three-day distribution did not rescue them from poor performance. Apparently, exposure to thirty seconds is not enough to support recollection two weeks later, whether the passage is heard in its entirety or in separate segments. n163
The research thus shows that both amount and frequency of exposure are significant in identifying a previously unfamiliar voice. Consistent with our intuitions, a longer initial exposure will lead to a more reliable identification later. Less intuitively obvious are the findings that exposure of half a minute is generally too little and that frequency of exposure is also relevant. As we have seen, the law takes little account of such results. Hauptmann was convicted of murdering the Lindbergh baby based largely on Lindbergh's being exposed to the speech of the perpetrator for perhaps two or three seconds. Even today, courts sometimes allow an identification based on a very brief exposure, including in one case the single word "yes." n164
To date, there is not much research on the amount of exposure it takes to recognize familiar voices. We have all had the experience of making mistakes in recognizing familiar voices, especially on the telephone. Peter Ladefoged, an eminent phonetician who has studied the voice identification issue, and whom one would expect to be quite good at the task, has admitted that he could not even identify his own mother's voice saying "hello." n165 In fact, he also did not recognize her voice when the input was a full sentence. n166 A recent experiment by Australian researchers confirms Ladefoged's experience as typical. Based on the word "hello" alone, subjects were able to identify highly familiar voices a mere 47% to 60% of the time. n167 Increasing the length of the utterance to eight syllables resulted in 70% to 100% accuracy. n168 This area is one that is ripe for additional research.
[*400]
(3) Delay
We all know that memory deteriorates over time, but research shows that it doesn't happen linearly. It seems that we remember voices quite well for some period of time, perhaps as long as a few weeks, and then our memories fade significantly.
A simple, but elegant, experiment was published in 1937 by Frances McGehee. n169 The experiment, inspired by the Hauptmann trial, aimed to determine how well people can identify unfamiliar voices after extended periods of time. n170 In the study, students listened to a person reading a fifty-six-word passage from behind a screen. n171 The students were asked at various subsequent times whether they recognized any of five voices presented to them at the testing session. n172 The results are presented in Table 2:
Table 2
Effect of Delay on Accuracy of Identification n173

[see org] In a follow-up study conducted under somewhat different conditions, McGehee found that performance deteriorated to 48% after a two-week delay, but stayed more or less steady after that. n174
[*401] Similarly, Defenbacher and his team found significant decay in recollection after two weeks, especially when the listener had only a single exposure to the voice. n175 They concluded that "if the initial memory strength of the voice trace is weakly enough established, then, voice identification accuracy will not be very impressive even at delay intervals briefer than those possible in forensic situations." n176
While the numbers differ somewhat from one study to another, perhaps depending on the amount and frequency of the initial exposure, the overall picture is fairly clear. In identifying unfamiliar voices, we perform much better if asked to do so immediately after hearing the voice. If there is a delay beyond that, our memories seem to remain fairly stable for a few weeks, after which performance drops off significantly. Moreover, at least after a rather brief initial period, the amount of exposure to the voice interacts with the length of the delay.
Once again, we have seen little indication that courts making evidentiary rulings take these findings into account, or that jurors do so in evaluating evidence that has been admitted. The twenty-nine-month delay in the Hauptmann case might still be acceptable today in some courts. This is not to say that courts do not consider the issue of time lapse; in fact, they almost always mention it. But they greatly underestimate the extent to which memory for voices decays over time. Consider the New York case in which a court allowed a policeman to identify a voice on tape after a time lapse of fifteen months, n177 or the Knox case, where the delay was around three years. n178 We are not claiming that it is impossible to remember a voice for that period of time, but we do believe that the legal system should take this cognitive frailty into account far more than it does.
(4) Individual Variation
Some people seem to be born musicians. They can hear a tune once and sing it exactly on key. Others are virtually tone deaf. Do we vary similarly in our ability to identify voices? The research is clear: Some people are quite good at identifying speakers from their voices, and other people are terrible at it. n179 This should not be surprising. We know from both personal experience and from experimental testing that people differ enormously in their abilities to [*402] recognize faces. n180 Why should voice recognition be different? The legal system does not recognize such differences in skill. The rules of evidence certainly do not, and we have never seen a published opinion in which this issue was raised.
Experimenters have investigated the extent of individual variation in identifying voices and have tried to determine whether it is possible to predict ability to recognize voices from other cognitive skills. In one recent study, Olaf K<um o>ster and some colleagues gave thirty subjects (twenty-two non-expert subjects and eight expert subjects) a test in auditory speech sensitivity. n181 The test required subjects to imitate the relative pitch of two sounds, the voice onset of a syllable (voicing is turned on earlier in a syllable beginning with d than with one that begins with t), rhythm, nonsense syllables, and other linguistically-relevant sounds. n182 The maximum possible score on this test was 108. n183 Scores varied from a low of fifty-five to a perfect score of 108. n184
These same subjects were also given a test of their speaker recognition ability. n185 First, they listened to a five-minute sample of a male speaking German (the entire experiment was carried out in Germany in German). n186 After a five-minute break, subjects were presented with eighteen samples of speech from each of six different male speakers with similar voices, for a total of 108 samples. n187 One of the six was the target voice, to which they had just been exposed. The only task was to indicate, for each of these 108 speech samples, whether it was uttered by the target. n188 In a perfect performance, the subject would identify all eighteen of the target's samples as the target's ("hits"), and identify none of the other ninety samples as the target's ("false alarms"). n189
The results showed great variation in ability. While eight subjects got perfect scores, one subject made just about as many false alarms as hits. n190 Importantly, there were statistical correlations indicating that people who perform better on the auditory speech [*403] sensitivity test are likely to be better at speaker identification. n191 Yet much of the correlation comes from the fact that poor performers on the speech sensitivity test are typically not very good at speaker identification. This suggests that in future work it will be easier to predict poor identifiers than it will be to predict good identifiers. What is clear right now is that in some cases the legal system should permit defendants to inform the jury that some people are good at identifying speakers and some are bad at it, even under the optimal experimental conditions of this study. n192
When it comes to admitting tape-recorded evidence, judges sometimes seem to assume that law enforcement officers will be particularly good at this task, but little evidence supports this assumption. Interestingly, experimental evidence suggests that police officers are no better at eyewitness identification than lay witnesses. n193 Thus, the law's main assumption regarding variation in voice identification abilities is at best unproven and quite possibly wrong.
(5) Emotional State and Tone of Voice
Many crimes requiring voice identification as part of their solution happen suddenly. This is especially true of violent crimes such as rape, burglary, and robbery. The victim or other witness did not see the perpetrator but is later asked if she can identify his voice. One question we might ask is whether the stress of these experiences heightens one's perceptiveness, making it easier to identify a voice later, or whether stress has the opposite effect. Research on this issue concerning eyewitness identification shows that stress makes us worse at identifying faces, despite our intuitions to the contrary. n194 Does the same hold true for the identification of people by their voices?
In an interesting study, Saslove and Yarmey had 120 experimental subjects engage in what they were told was an experiment on clairvoyance. n195 While an experimenter was conversing with a subject, an angry, hostile voice was heard from a tape recorder in the next room for about twelve seconds. n196 The experimental [*404] subjects were subsequently n197 asked to pick the voice out of a voice lineup of five speakers. n198 All five speakers uttered the same words as the original angry voice. n199 For half of the subjects, the target voice used the same hostile tone. n200 For the other half, she used a calm voice. n201 In addition, half the subjects were told in advance that they would be asked to identify a voice, while the other half were uninformed. n202 Thus, there were thirty subjects in each of four conditions.
The results, summarized in Table 3, are dramatic:
Table 3
Number of Subjects Out of Thirty Correctly Identifying Speaker n203
[see org] Since there were only five voices, one would expect six of the thirty subjects responding to this condition to identify the target voice even if everyone were guessing. In the hardest condition, where subjects were uninformed and where the target tone of voice was different from the original, subjects performed worse than chance. This study suggests that voice identification based on short exposure under stressful conditions is likely to be inaccurate, although that issue remains controversial. n204
The Saslove and Yarmey experiment also suggests that certain voice qualities vary with the emotional state of the speaker. Research shows this to be the case. For instance, a voice's fundamental frequency, which relates largely to pitch, increases when we speak under stress. n205 Unfortunately, such changes are not always [*405] predictable, which means that to the extent that emotional states lead to changes in voice quality, they complicate the process of voice identification. n206 Note that in the Saslove and Yarmey experiment, the rate of correct identifications was relatively good when subjects were able to compare an angry target voice with other angry voices. In contrast, their identification levels were quite low when they were later asked to compare the originally angry voice with calm voices.
Because perpetrators of a crime are likely to be excited or angry, and the victims under stress, voice identification in these circumstances may be difficult. Yet that is precisely the condition, as Saslove and Yarmey state, that "might be considered most similar to the legal setting." n207 One case where the suspect's tone of voice made a difference was State v. Johnson, where a man was very calm and soft-spoken while raping a woman. n208 When later confronted with his voice through an open door at the police station, where he was speaking in an angry and abusive tone, she could not positively identify him. n209 When he calmed down and spoke more normally, however, she claimed to recognize his voice immediately. n210
Although the emotional state and tone of voice of the speaker are important in predicting the reliability of an identification, courts do not take these factors into consideration as such. Part of the reason may be that the Biggers factors were formulated primarily to deal with eyewitness identification and therefore do not take into account some of the specific factors relating to voice identification. However, the Biggers criteria do include the degree of attention that the witness was paying, which relates to the emotional state of the witness. Courts seem to assume that stress increases the witness's attention and thus the reliability of the identification. The extent to which a witness was paying attention certainly is relevant, but the evidence so far indicates that stress in itself undermines reliability.
(6) The Problem of Disguise
Even more troublesome for voice identification are attempts to disguise one's voice or imitate the voice of someone else. The easiest way to disguise a voice is to whisper. Many of the acoustic features [*406] that permit us to identify a speaker (like voicing) are absent when people whisper. Thus, the distinction between voiced consonants (like z) and voiceless ones (like s) largely disappears when a person whispers. As a result, the words zap and sap are difficult to distinguish when whispered.
Yarmey and his colleagues, in the same set of studies discussed earlier in connection with familiarity, had speakers record a speech sample in a whisper. n211 Recall that the experiments compared people's ability to identify voices based on their familiarity with the speaker. When the passage was whispered, highly familiar voices were identified correctly 77% of the time (versus 89% in a normal tone), moderately familiar voices 35% (versus 75%), voices with low familiarity 22% (versus 66%), and unfamiliar voices were acknowledged as such 20% (versus 61%). n212 False alarm rates were also significantly higher. n213 In short, a speaker who wishes to mask his voice by whispering has a good chance of succeeding - especially if he is not a very close friend or family member of the hearer, and even then he might succeed. Perhaps more disturbing is that independent panelists, when asked how often listeners were likely to be correct in identifying whispered voices, wildly overestimated their capacity to do so, guessing 91% for highly familiar voices, down to 74% for unfamiliar ones (versus actual success rates of 77% and 20%). n214 If jurors have similar misconceptions about this skill, it is not good news for defendants accused of having whispered an incriminating or illegal statement.
Studies have reached similar conclusions regarding other types of phonetic disguises. The Hollien group instructed speakers to mask their voices however they wished. n215 Experimental subjects were able to identify disguised familiar voices 79% of the time but could do no better than 20.7% with disguised unfamiliar voices. n216 Tactics used in some criminal contexts can lead to complex phonetic changes that make speech significantly more difficult to identify. For example, Brazilian kidnappers have been reported to place a pencil between their front teeth, under the tongue, to disguise their ransom demands. This leads to complex phonetic changes in speech that make the speaker significantly more difficult to identify. n217
[*407] Imitation is an especially pernicious form of disguise. People who are good at imitating the voices of others have the power to cast suspicion on the innocent. How good are people at detecting imitators? A study conducted in Sweden examined how well people could identify in a voice lineup the voice of Carl Bildt, the former Prime Minister of Sweden, who was well known to subjects. n218 In one set of conditions, Bildt's voice was present in the lineup along with that of a good political impersonator, who imitated Bildt's voice. n219 Encouragingly, subjects almost always knew the real Bildt. n220 But when Bildt's voice was not among the choices, almost all subjects mistook the impersonator's voice for Bildt's. n221 These results suggest that a good imitation can fool people, especially when the actual voice is not present for comparison.
The emerging field of forensic phonetics is making progress in characterizing various ways in which people can mask their voices but still has not produced a systematic approach to the problem. n222 Researchers have begun to determine what features of a speaker's normal voice are likely to remain intact even when he tries to disguise it. n223 Yet disguise remains a problem, both for lay and expert identification of voices.
(7) Foreign Languages, Accents, and Other Linguistic Variables
Research has shown that eyewitnesses are generally better at identifying someone of the same race. n224 Are people similarly better at identifying speakers of their own language? In one experiment, K<um o>ster and Schiller investigated how well native speakers of Spanish and Chinese can identify a German speaker by his voice. n225 The experimenters contrasted subjects who knew German as a second language with those who knew no German at all. n226
[*408] Subjects were presented with a five-minute sample of a native speaker of German speaking in that language. n227 They were then asked to identify the speaker from a voice lineup consisting of six native speakers of German. n228 The results show that Spanish speakers who know some German performed significantly better than Spanish speakers who do not know German. n229 The same result held for the Chinese speakers. n230 These findings suggest that results of voice lineups involving speech samples in a language that the witness does not understand should, as the authors note, "be handled with caution." n231
A study by Charles P. Thompson came to a similar conclusion, while providing some further details. n232 The voices were produced by Spanish-English bilingual speakers. n233 Some samples were in Spanish, others in English, and yet others in English with a Spanish accent. n234 Each sample was presented, along with five others of the same type, to monolingual English speakers who had been familiarized with the target voice one week before. n235 Results confirmed that English speakers are much worse at identifying someone speaking Spanish. n236 When confronted with speakers using English with a Spanish accent, the accented voices were recognized better than Spanish voices, but worse than English ones. n237
A variation on this experiment was conducted by Goggin and her colleagues. n238 This research team employed a similar methodology but used listeners who were bilingual in English and Spanish. n239 As opposed to monolingual English speakers, the English-Spanish bilinguals did not differ significantly in their ability to identify targets speaking English, Spanish, or English with a Spanish accent. n240
The fact that it is significantly more difficult to identify someone speaking an unknown language reveals that the term voice identification is actually a misnomer. If the task were simply to [*409] identify a voice, it would logically make no difference at all whether we understand the language. In fact, it does matter. The reason is that the ultimate task is to identify the speaker. The quality of the speaker's voice may be an important clue in this endeavor, but it is not the only one. We also use other linguistic variables that depend on our ability to understand what is said and how it is said. Hollien's team, for example, lists a number of speech characteristics that listeners use to identify the speaker, including dialect, unusual use of linguistic stress or affect, idiosyncratic language patterns, speech impediments, and idiosyncratic pronunciations. n241
We have the most nonacoustic (or "non-voice") information at our disposal when we hear someone speak our native language, less when the speaker is not a native speaker of our language, and none at all when the language we hear strikes us as merely a babble of meaningless sounds. What if subjects are confronted with an actual babble of meaningless sounds? Niels Schiller and colleagues presented what was essentially a long series of mamamamama to native English and German speakers, along with some English speakers who had studied German. n242 Correct identification of the speaker was low for all three groups, especially when compared to the success rate of a native speaker of a language who is asked to identify speech produced by other native speakers of that language. n243
The effect of general speech characteristics on voice identification has not been extensively studied. n244 Fortunately, some courts seem to have an intuitive notion that foreign accents may present a problem and have sometimes required expert testimony if a witness is to identify a voice by its accent. n245 Others, as we saw above, seem to consider it largely irrelevant that a speaker was speaking Spanish on one occasion and English on another. n246
[*410]
(8) Other Factors
This discussion of factors bearing on our ability to recognize voices is by no means exhaustive. For example, age plays a role in the reliability of voice identification. Speakers between twenty-one and forty are better voice identifiers than are adults over forty. n247 Moreover, people's voices change as they get older. n248 In fact, that issue arose in a case decided by the United States Court of Appeals for the First Circuit. n249 In that case, the defendant's voice was recorded in 1971, but it wasn't until 1975 that an agent met the defendant and compared the voice on the tape to the voice he had heard. n250 The defendant argued that this four-year delay made the identification improper. The court disagreed, applied Rule 901, and stated that the delay should go to the weight that the jury gives to the evidence. n251
Nonetheless, we have attempted to touch on the most legally-relevant research on how good people are at identifying voices. The legal system relies almost entirely on its own notions of common sense and intuition and has never systematically taken this knowledge into account. Moreover, to the extent that the law purports to require some investigation into the reliability of an identification, the research reported here suggests that the Biggers criteria fall seriously short of the mark.
B. Witness Confidence
Among the criteria for predicting reliability that the Supreme Court endorsed in Biggers is "the level of certainty demonstrated by the witness at the confrontation." n252 Research shows that jurors, like judges, take statements of confidence seriously. Unfortunately, research also indicates that there is at best a limited relationship between the probability of accuracy and the degree of confidence that the witness has in the identification. If people react positively to the confidence of the identifier and confidence fails to predict accuracy, [*411] then we might expect people to overestimate the likelihood that an identification will be accurate. That is just what seems to happen.
Several researchers have studied the relationship between accuracy and confidence in connection with speaker identification. n253 For the most part, the research indicates little positive correlation. Defenbacher et al., who set out to study the significance of the factors suggested in Biggers, conclude from their studies that the Supreme Court was probably wrong. "The safest generalization to make is that earwitness as well as eyewitness confidence are not very reliable indices of identification accuracy." n254 Yarmey's review of the literature led him to reach the same bottom line. n255
Moreover, jurors are likely to be swayed by confidence levels. A recent study by Amy Bradfield and Gary Wells shows that people pay a great deal of attention to how confident a witness is in his identification in deciding how much weight to give it. n256 This bias can lead to insufficient skepticism on the part of jurors whose job it is to assess the reliability of a witness's identification.
Finally, people seem to have an inflated sense of how good we are, as human beings, at identifying voices. Recall the study by the Yarmey team demonstrating that people are much more successful at identifying highly familiar voices than they are at identifying voices of moderate or low familiarity, or identifying unfamiliar voices. n257 In a related study discussed in the same article, another set of experimental subjects was asked to estimate how good listeners would be at identifying voices from each of the four levels of familiarity. n258 For every level of familiarity, people assume that identifiers will be more accurate than is the case. n259 This gap between perception and reality suggests that jurors may be predisposed to give too much weight to identification by voice.
These results have serious ramifications. Prosecutors must prove their cases beyond a reasonable doubt. Although jury instructions do [*412] not use numerical certainty thresholds, most people within the system, when asked, say that proof beyond a reasonable doubt requires about a 90% level of certainty. n260 Based on this experimental data, people appear to assume that under most circumstances voice identification is correct about 90% of the time; but in reality, it is significantly less reliable, especially when we are not very familiar with the voice being identified. The legal system's failure to correct this overestimation may result in some jurors wrongly concluding that the government has met its burden of proof.
Taken together, these facts tell a disturbing story. People rely on an identifier's level of confidence in judging how accurate the identification is likely to be. But that level of confidence correlates only slightly with the likelihood of accuracy. The result is that people tend to place too much credence in an identification. Again, this situation cries out for judicial safeguards. We present some possible solutions to these problems at the end of this Article.
III. Expert Voice Identification
When the issue is the admission of a tape-recorded voice, someone must determine whether the defendant's voice matches the voice on a tape. We have seen that in the typical case a witness, usually a police officer, is called to identify the voice as that of the defendant. The question we ask here is whether training in voice identification, or phonetics in general, is helpful in this task. If so, it calls into question the statement in the Advisory Committee notes to Rule 901 that "aural voice identification is not a subject of expert testimony." n261
We then turn to the question of sound spectrography, or voiceprints, a largely mechanical method that is claimed to be able to distinguish voices. If such a method is reliable, it could prove quite useful, not only in laying a foundation to admit tape recorded evidence, but also to help prove the ultimate issue of the identity of the speaker. As we will see, voiceprints are more reliable than is sometimes suggested in the literature, but questions still remain, especially when recordings are made in settings that are less than ideal.
A final role that experts might perform is to inform courts and juries about the ability of people to make lay voice identifications. For example, it might have been helpful to Hauptmann's defense to have had an expert testify about some of the problems with [*413] Lindbergh's identification of Hauptmann's voice. Whether conveyed by an "educational expert" or by jury instructions, the type of learning discussed in this Article should make its way to the jury when a case rests in part on a questionable identification of a speaker.
A. Aural Identification by Experts
Consider a case in which the issue is whether the voice on a tape recording is the defendant's voice. The question arises in many drug and organized crime cases. As we have seen, some people are better at making this comparison than are others. n262 Here, we ask whether linguistic experts specializing in phonetics are typically better than lay people at aural identification (i.e., identification by ear as opposed to using machines). The answer seems to be yes.
Hollien and Schwartz tested people's abilities to identify voices by comparing contemporary samples with non-contemporary ones. n263 All samples were on tape. n264 There were three groups of subjects: people with no background in phonetics, experienced phoneticians, and students with some background in phonetics. n265 The results are presented below:
Table 4
Effect of Expertise on Voice Identification Skill n266
[see org] From these results, it appears that training in phonetics increases performance on identification tasks.
In another study, Schiller and K<um o>ster had six male native speakers of German record a passage that lasted about one minute. n267 For each speaker, three pieces were spliced out of the original tape, and each of those three pieces was recorded six separate times, making a total of 108 speech samples (6 x 3 x 6 = 108). n268 Twenty-seven subjects participated in the experiment. n269 Seventeen were [*414] college students, and ten were experts in phonetics. n270 For each subject, one of the original six speakers was designated as the target. n271 The subject listened five times to the target reading the entire passage. n272 Then, after a five minute break, subjects were presented with all 108 short segments, and instructed to indicate whether the segment was uttered by the target. n273
Both expert and lay subjects did very well on this test, which was designed around ideal conditions. n274 Still, the experts performed significantly better (98% hits, 1% false alarms) than the lay subjects (92% hits, 2% false alarms). n275 The differences may not look dramatic, but the reduced error rate would certainly be important in a trial setting. n276
A second study by K<um o>ster et al. resulted in a similar finding. n277 Experts in speech or singing and non-experts participated in a voice identification task. n278 While half of the non-experts (eleven of twenty-two) were poor identifiers, only one quarter of the experts (two of eight) were poor identifiers. n279 Although the number of subjects is low, the study suggests that the use of experts may improve the accuracy of voice identification under certain forensic circumstances. n280 Significantly, these studies highlight the fact that phonetics is an independent field of scientific research, which takes seriously the need to investigate its own strengths and limitations. This internal scrutiny distinguishes it from other areas of forensic identification, such as handwriting analysis and microscopic hair [*415] analysis, which have received substantial criticism in the legal literature. n281
These results do not mean that the system should require expert identification of voices on tapes. But they do suggest that the courts should be receptive to such experts in cases where voice identification is critical, especially when the admission of the tape into evidence is based on little exposure to the voice. n282 Expert phoneticians may also be appropriate when a police officer or other witness becomes familiar with a voice specifically in order to become eligible as an authenticating witness. n283
Consider United States v. Drones, in which the Fifth Circuit recently reversed the district court's grant of the defendant's petition for habeas corpus based on ineffective assistance of counsel. n284 The case against Drones, who was convicted of various drug crimes in a Texas state court, depended heavily on a police officer's identification of his voice on a tape. n285 His attorney, however, did nothing to challenge the identification. n286 Later, after new counsel was retained, witnesses who knew Drones listened to the tape, and concluded that it was not his voice. n287 In addition, a voice identification expert opined from both aural and spectrographic analysis that the voice on the tape was not that of Drones. n288 The government also had an expert who debunked spectrographic analysis but agreed that the voice on the tape did not sound like that of the defendant. n289
The Fifth Circuit held that the original lawyer did not act irresponsibly in not pursuing the lay testimony, which may have opened up other questions about the defendant's background. n290 It also held that it was not irresponsible of the original lawyer not to pursue spectrographic comparison of Drones's voice with that on the tape, since courts have not been receptive to such analysis in recent years. n291 But the court never said why the defendant's lawyer should not have had an expert study both the tape and exemplars of the defendant's voice and offer his opinion based on aural analysis. n292 At [*416] the very least, an expert could have pointed out to the jury ways in which particular sounds differed from one sample to the other and left it to them to agree or disagree. The case is especially compelling because the government's own expert shared the defense expert's opinion n293 and because the standards for admitting the original identification are so relaxed.
In fact, courts have permitted expert phoneticians to present opinion testimony on speaker identification when the voice is on tape. n294 Their expertise is parallel to that of other experts who are permitted to assist the jury on questions of identification. For example, experts are permitted to interpret surveillance photos to point out similarities between the facial features of the defendant and the individual in the surveillance photo. n295 In one case, it was held to be reversible error not to permit the defendant to offer such testimony to the jury. n296 While these experts are not always permitted to offer opinions as to identity, they are routinely allowed to share with the jury detailed observations regarding facial shape and measurement.
At the very least, phoneticians should be permitted to point out similarities and differences between the defendant's voice and that of the person on a tape in order to make them salient to the jury. To the extent that such analysis can be enhanced by comparison of the relevant features of spectrograms, we see no reason why the experts should not be permitted to use that information as well. For example, to enhance her testimony based on aural comparison of two voices, a phonetician may want to show a jury how one speaker's [a] sound routinely appears in one area of the spectrogram, while the [a] sound on the tape that is in evidence appears elsewhere. Most phoneticians use both types of information. n297 This use of acoustic information is quite different from that used by so-called voiceprint experts, whose claims have been a matter of controversy for several decades, an issue to which we now turn.
B. Spectrographic Evidence
DNA evidence has become an important forensic tool, both for law enforcement agencies and for those who have been wrongly [*417] accused. A technology able to compare voices with such accuracy would obviously be a welcome addition to the criminal justice system. For a time, at least, there was hope that "voiceprints" could play that role. Voiceprints, or technically "sound spectrograms," are visual representations of the frequencies and amplitude of sounds as represented on a time line. In the forensic setting, spectrographic analysis involves visual comparison of the spectrogram of the questioned voice with one from a known voice, typically the voice of the defendant in a criminal trial. Most of those who conduct this kind of analysis are not phoneticians, but rather police officers and technicians who have been trained for this specific task and typically have limited backgrounds in acoustics or phonetics. The main issue is whether the methodology produces sufficiently reliable results.
The early history of this debate is both legally significant and interesting in its own right. Sound spectrography was developed in the 1940s by Bell Laboratories for teaching deaf people how to speak n298 and was quickly pursued for use in military operations during World War II. n299 Then, in 1962 Lawrence G. Kersta, of Bell Labs, published an article in Nature, making some extravagant claims about the ability to identify speakers by their voiceprints. n300 He likened this technology to fingerprints and asserted that people's voices are also unique and identifiable through visual inspection of their voiceprints. n301 The scientific community reacted skeptically. For example, the Committee on Speech Communication of the Acoustical Society of America had the following reaction:
We conclude that the available results are inadequate to establish the reliability of voice identification by spectrograms... . Procedures exist, as we have suggested, by which the reliability of voice identification methods can be evaluated. We believe that such validation is urgently needed. n302
[*418] Prominent phoneticians, including Dr. Peter Ladefoged of UCLA, went on record as opposing the use of spectrography in the courtroom as inadequately tested. n303
During the late 1960s and early 1970s, American jurisdictions were divided on the issue of spectrographic evidence. Some courts rejected the methodology as not widely enough accepted within the scientific community. n304 Typically, these jurisdictions applied the Frye test, which stated that for expert opinion testimony to be admitted, "the thing from which the deduction is made must be sufficiently established to have gained general acceptance in the particular field in which it belongs." n305 Other appellate courts reached the opposite conclusion, holding that it was within the discretion of the trial judge to admit the testimony because the technology showed indications of being sufficiently reliable. n306
At this same time, Oscar Tosi, a Michigan State University professor, published studies that purportedly demonstrated great accuracy in the identification of speakers by visual inspection of voiceprints. n307 Tosi's design corrected the worst problems with Kersta's study. While Kersta used a closed set of possible voices and contemporaneous recording under laboratory conditions, Tosi's study compared non-contemporaneous recordings in both fixed and random contexts, which more closely simulated forensic settings. n308 Tosi's results were impressive. There were approximately 6% false alarm errors and approximately 13% false elimination errors. n309 He conjectured that the error rate could be reduced further if the examiner is permitted to answer "I don't know" if he did not reach a high level of certainty. n310
The Tosi study led to modification of Professor Ladefoged's opposition to the use of voiceprints, but his softening was hardly a ringing endorsement. Ladefoged expressed his position in a 1971 [*419] letter to the President's Science Advisor. n311 In the letter, he expressed concern that the Tosi study did not deal with the problem of people in the same community, say a gang of high school dropouts, who might have very similar speaking styles and mutually confusable voices. n312 He also expressed concern about the lack of standards governing voiceprint experts and their work. n313 The United States Court of Appeals for the D.C. Circuit summarized his position:
As Dr. Ladefoged's cited letter to the President's Science Advisor itself indicates, however, his conversion to the voiceprint identification technique has been a limited one. While Dr. Ladefoged stated that new evidence had moved him to "cautiously reconsider the possibility' of the use of spectrogram analysis in criminal trials, he went on to express a number of continuing reservations. He pointed out, for instance, that the Tosi studies did not necessarily indicate that spectrogram analysis would enjoy a comparable success rate when applied to the general populace, and indicated that voiceprint identification of females would probably be more difficult than identification of males. Dr. Ladefoged further identified problems arising from voice mimicry and from the possibility of "confusable voices," and concluded that "we do not at the moment know the probable error rate" of a spectrogram analysis technique applied to the broad populace. Thus, viewed in its entirety, Dr. Ladefoged's letter, as he himself characterized it in his testimony, simply reflects a position "of abatement of skepticism towards voiceprint," not one of complete acceptance. This position, according to his testimony, was shared by the community of scientists. n314
[*420] In an introductory text to phonetics dating from that same period, Ladefoged explained that it was his "best guess" that experts using spectrograms were wrong in about one case out of twenty, which means that it is a useful - but limited - law enforcement tool. n315 Ladefoged went on to characterize as "completely irresponsible" the assertions of witnesses in court that "the voice on the recording is that of the accused and could be that of no other speaker." n316
While the scholarly community gave the Tosi study mixed reviews, n317 it was good enough to convince some courts that voiceprint analysis was sufficiently valid for courtroom purposes. The federal and state law reports contain a number of cases in which voiceprint analysts were permitted to testify over the objection of the opposing party, typically for the prosecution in criminal trials. n318 Other courts continued to reject spectrographic evidence. The standard for admissibility under Frye was acceptance in the scientific community, and the debate under Frye was often, "whose community?" n319 In Tosi's community of supporters, voiceprint analysis was widely accepted. In the broader community of acoustic phoneticians, it was not. This difference explains, at least in part, the divergent court rulings.
Then, in 1975, the Federal Rules of Evidence were adopted. Its standard for admissibility of expert testimony, that the expert "will assist the trier of fact," n320 would seem to leave more opportunity for a court to admit spectrographic analysis through experts. In 1979, however, an influential report by the National Research Council, usually called the "Bolt Report," questioned the ability of voiceprints to produce accurate results under forensic conditions with [*421] sufficiently low rates of error. n321 The report summarized its findings in the introduction:
The Committee concludes that the technical uncertainties concerning the present practice of voice identification are so great as to require that forensic applications be approached with great caution. The Committee takes no position for or against the forensic use of the aural-visual method of voice identification, but recommends that if it is used in testimony, then the limitations of the method should be clearly and thoroughly explained to the fact finder, whether judge or jury. n322
The Committee later explained:
The degree of accuracy, and the corresponding error rates, of aural-visual voice identification vary widely from case to case, depending upon several conditions including the properties of the voices involved, the conditions under which the voice samples were made, the characteristics of the equipment used, the skill of the examiner making the judgments, and the examiner's knowledge about the case. Estimates of error rates now available pertain to only a few of the many combinations of conditions in real-life situations. These estimates do not constitute a generally adequate basis for a judicial or legislative body to use in making judgments concerning the reliability and acceptability of aural-visual voice identification in forensic applications. n323
It is important to note that the Committee did not dispute Dr. Tosi's findings. In fact, Tosi was on the Committee. n324 Rather, the report complained that findings supporting the use of voiceprints were too limited. n325 They failed to consider important real-life variables that would be necessary to draw valid conclusions about forensic use of voiceprints. n326 We make this point not to advocate for the acceptance of voiceprint analysts in the courts (we do not), but to point out that the scientific community has generally been straightforward about the abilities and limitations of voiceprint analysis.
Subsequently, Dr. Ladefoged reached similar conclusions. A Hawaii court quoted him as making the following points in 1985:
Dr. Ladefoged proposes the following safeguards: (1) two plus minutes of each speech sample; (2) a signal to noise ratio where the signal is higher by 20 decibels; (3) a frequency of 3,000 hertz or better; (4) an exemplar in the same words, the same rate, in the same way, spoken naturally and fluently; and (5) a responsible [*422] examiner. Dr. Ladefoged believes there is general acceptance given his safeguards, and he believes there is now more agreement. n327
Rarely will all these safeguards be met, making visual voiceprint analysis of limited evidentiary value. For this reason, some linguists continue to express serious doubts about the reliability of this technology in a forensic setting. Indeed, one phonetician has called it "a fraud being perpetrated upon the American public and the Courts of the United States." n328
Surprisingly enough, throughout the 1980s and early 1990s the published opinions, albeit in smaller numbers overall than before, continued to be split on the issue. Decisions to admit voiceprint evidence were reached during that period by the United States Courts of Appeals for the Sixth and Seventh Circuits, a federal district court in Hawaii, the supreme courts of Ohio, Maine, and Rhode Island, and a lower court in New York. n329 But during roughly the same time, voiceprints were held inadmissible by the high courts of Arizona, Colorado, Indiana, Louisiana, and New Jersey. n330 Clearly, the courts are seriously divided.
This disagreement has not abated, despite significant legal developments over the past decade. In 1993, the Supreme Court clarified the standard for admissibility of expert testimony under Rule 702 of the Federal Rules of Evidence when it decided Daubert v. Merrell Dow Pharmaceuticals, Inc. n331 The issue in Daubert, a products liability case, was whether Bendectin, an anti-nausea drug taken during pregnancy, had caused the plaintiff's birth defects. n332 The epidemiological literature suggested that it did not. n333 The plaintiff in Daubert wanted to call experts who would attack the inferences drawn from the data in the published literature and bring to bear the results of animal studies. n334 The trial court had rejected the experts on the grounds that their work had not been published and therefore failed to meet the standards of scientific reliability that the courts had developed under Frye. n335 It thus granted summary judgment to the [*423] defendant, Merrell Dow. n336 The court of appeals affirmed the trial court's decision. n337
The Supreme Court reversed, holding that the Federal Rules of Evidence had replaced the Frye standard. n338 It interpreted Rule 702 as requiring courts to engage in a "preliminary assessment of whether the reasoning or methodology underlying the testimony is scientifically valid and of whether that reasoning or methodology properly can be applied to the facts in issue." n339 To be "scientifically valid" the proffered evidence need not be uncontroversially accepted in the scientific community. n340 Rather, "the adjective "scientific' implies a grounding in the methods and procedures of science." n341
The Court did not attempt to state conditions that are both necessary and sufficient for evidence to be "scientifically valid." n342 It did suggest, however, four non-exclusive indicia: whether the theory offered has been tested; whether it has been subjected to peer review and publication; the known rate of error; and whether the theory is generally accepted in the scientific community. n343 In a subsequent case, Kumho Tire Co. v. Carmichael, the Supreme Court made it clear that the Daubert criteria are to be applied to experts purporting to testify based on their experience, as well as to experts purporting to rely on scientific advancements. n344 Since then, Rule 702 has been amended to incorporate these holdings. n345 Undeniably, these principles govern the admissibility of voiceprint testimony. n346
No federal courts have ruled on the admissibility of voiceprints since Daubert. However, one state court, the Supreme Court of Alaska, ruled voiceprint testimony admissible under Daubert in 1999, while the United States Court of Appeals for the Fifth Circuit [*424] expressed a great deal of skepticism about the technology in 2000. It is instructive to compare the two cases.
As the Alaskan court noted in State v. Coon, some published reports support the use of voiceprints in court. n347 Because the philosophy of the Federal Rules, under which Daubert was decided, is to deal with controversy by presenting both sides of an argument (rather than by excluding evidence altogether), the court placed as much or more emphasis on studies sponsored by police officials that advocate for the use of voiceprints as it did on publications from the independent scientific community. n348
Compare Coon to United States v. Drones, n349 the Fifth Circuit habeas corpus case discussed earlier in connection with expert testimony of aural voice comparison. In that case, a defendant convicted of drug offenses in Texas argued that his lawyer had not effectively represented him at trial because he failed to hire a forensic phonetician to compare the voice on a tape to his voice. n350 In support of his motion, Drones enlisted the help of an expert named Steve Cain, the voiceprint expert whose testimony was allowed by the Supreme Court of Alaska in Coon. n351 Cain "reached a finding of "probable elimination,' meaning that at least 80% of the comparable words in the samples were dissimilar aurally and spectrographically." n352 In response, the government called Bruce Koenig, a former FBI employee, who had been one of the early developers of sound spectrography. n353 Koenig testified that "almost nobody" in the relevant scientific community uses spectrographic voice identification because there is no theoretical basis for the proposition that an individual's voice is truly unique and identifiable. n354
In reversing the lower court's granting of the habeas corpus petition, the appellate court characterized spectrography as "a dwindling science," not widely accepted in the scientific community. n355 It quoted Koenig's testimony to the effect that the number of practitioners of forensic voiceprint analysis had dwindled from about [*425] fifty or sixty to roughly a dozen, as a result of judicial skepticism of the methodology's scientific basis. n356
While the Court mentioned the Alaskan decision in Coon, it left out the most significant irony: The scientific evidence upon which the court permitted Cain to testify in Alaska was a 1986 article that Koenig had published in the FBI Crime Lab Digest. n357 In fact, a close reading of Koenig's article suggests that the Alaska court misstated Koenig's position. In the article, Koenig indeed said that the rate of error in the FBI's use of voiceprints was extraordinarily low (0.31% for false identifications and 0.53% for false eliminations). n358 However, he also said that "meaningful decisions were only made in 34.8% of the requested comparisons." n359 Koenig concluded:
Spectrographic voice comparison is a relatively accurate, but not positive technique for comparing an unknown voice sample with known verbatim voice exemplars. Present use of the technique is limited to a relatively small number of examiners who confront legal barriers to acceptance, limitations in accuracy and no universally recognized examiner qualifications and examination criteria. Its forensic future may shift to testimony where the judge and jury are advised of the technique's probable accuracy or to nontestimoni