Effect Barry Arons MIT Media Lab 20 Ames Street, - TopicsExpress



          

Effect Barry Arons MIT Media Lab 20 Ames Street, E15-353 Cambridge MA 02139 [email protected] Abstract The ``cocktail party effect--the ability to focus ones listening attention on a single talker among a cacophony of conversations and background noise--has been recognized for some time. This specialized listening ability may be because of characteristics of the human speech production system, the auditory system, or high-level perceptual and language processing. This paper investigates the literature on what is known about the effect, from the original technical descriptions through current research in the areas of auditory streams and spatial display systems. The underlying goal of the paper is to analyze the components of this effect to uncover relevant attributes of the speech production and perception chain that could be exploited in future speech communication systems. The motivation is to build a system that can simultaneously present multiple streams of speech information such that a user can focus on one stream, yet easily shift attention to the others. A set of speech applications and user interfaces that take advantage of the ability to computationally simulate the cocktail party effect are also considered. Introduction ``One of the most striking facts about our ears is that we have two of them-- and yet we hear one acoustic world; only one voice per speaker [CT54] This paper investigates aspects of selective attention in the auditory system--under what conditions can a listener attend to one of several competing messages? Humans are adept at listening to one voice in the midst of other conversations and noise, but not all the mechanisms for this process are completely understood. This attentional ability has been colloquially termed the cocktail party effect [Han89]. The phenomenon can be viewed in many ways. From a listeners point of view, the task is intuitive and simple. From a psychological or physiological perspective there is a vast and complex array of evidence that has been pieced together to explain the effect--there are many interactions between the signal, the auditory system, and the central nervous system. Acoustically, the problem is akin to separating out a single talkers speech from a spectrogram containing signals from several speakers under noisy conditions. Even an expert spectrogram reader would find this task impossible [Bre90]. Most of the evidence presented has been obtained from perceptual experiments that have been performed over the last 40-odd years. Unfortunately, such perceptual evidence is often not as quantifiable as, for example, physical resonances of the vocal tract. Therefore, the bulk of the ideas and experimental results presented are qualitative, and an ``exact solution to the cocktail party problem cannot be found. While the focus of the paper is on voice signals and speech communication, note that much of the low-level perceptual evidence is based on experiments using simple stimuli, such as clicks, pure tones, or noise. The Separation of Speech Channels The cocktail party effect can be analyzed as two related, but different, problems. The primary problem of interest has traditionally been that of recognition: how do humans segregate speech sounds, and is it possible build a machine to do the task. What cues in the signal are important for separating one voice from other conversations and background noise? Can, and should, a machine use the same cues for the task, or can it use other acoustical evidence that humans are not efficient at detecting? The inverse problem is the synthesis of cues that can be used to enhance a listeners ability to separate one voice from another in an interactive speech system. In a user interface it may be desirable to present multiple digitized speech recordings simultaneously, providing browsing capabilities while circumventing the time bottleneck inherent in speech communication because of the serial nature of audio [Aro91, SA89]. Synthesis of perceptual cues by a machine for human listeners might allow an application to perceptually nudge the user, making it easier to attend to a particular voice, or suggest that a new voice come into focus. Early Work Much of the early work in this area can be traced to problems faced by air traffic controllers in the early 1950s. At that time, controllers received messages from pilots over loudspeakers in the control tower. Hearing the intermixed voices of many pilots over a single loudspeaker made the controllers task very difficult [KS83]. Recognition of Speech With One and Two Ears In 1953, Cherry reported on objective experiments performed at MIT on the recognition of messages received by one and two ears [Che53]. This appears to be the first technical work that directly addresses what the author termed the ``cocktail party problem. Cherry proposed a few factors that may ease the task of designing a ``filter that could separate voices: The voices come from different directions Lip-reading, gestures, and the like Different speaking voices, mean pitches, mean speeds, male vs. female, and so forth Different accents Transition probabilities (based on subject matter, voice dynamics, syntax . . .) All factors, except for the last, can be removed by recording two messages from the same talker on magnetic tape. The author stated ``the result is a babel, but nevertheless the messages may be separated. In a Shannonesque analysis, Cherry suggested that humans have a vast memory of transition probabilities that make it easy for us to predict word sequences [SW63]. A series of experiments were performed that involved the ``shadowing of recordings; the subject repeated words after hearing them from a tape recording. The contents of the recordings were often related, and in the same style, such as by selecting adjacent paragraphs from the same book. Recognition was often in phrases, and the subjects found the task very difficult, even though the recordings could be repeated an unlimited number of times. In no cases were any long phrases (more than 2-3 words) incorrectly identified, and the errors made were typically syntactically correct. In a slight variant of the setup, the subject was allowed to make notes with a pencil and paper. This long-term memory aid made the task much easier, and time required to perform the task was shortened--the messages were almost entirely separated by the subject. In a similar experiment, the spoken phrases were composed of strings of clichès strung together with simple conjunctions and pronouns (footnote-1). These artificially constructed ``highly probable phrases were nearly impossible to separate. Because the transition probabilities between phrases were low, the subject would select phrases equally from the two speech streams. Subjects also listened to different spoken messages presented to each ear with headphones. In this configuration there is no directionality, there is simply a dichotic signal. The subjects had no difficulty in listening to the message played to one ear while rejecting sounds in the other ear. The recognition process can easily be switched to either ear at will. The subject could readily shadow one message while listening, though with a slight delay. Norman states that ``the longer the lag, the greater advantage that can be taken of the structure of the language [Nor76]. Note that the subjects voice is usually monotonic and they typically have little idea of the content of the message in the attended to ear. Virtually nothing can be recalled about the message content presented to the other (rejected) ear, except that sounds were occurring. This is what might be called the ``what-did-you-say phenomenon. Often when someone to whom you were not ``listening asks you a question, your first reaction is to say, ``uh, what did you say? But then, before the question is repeated, you can dredge it up yourself from memory. When this experiment was actually tried in my laboratory, the results agreed with our intuitions: there is a temporary memory for items to which we are not attending, but as Cherry, James, and Moray point out, no long-term memory. ([Nor76] page 22) In follow-up experiments, the language of the signal in the rejected ear was switched to German (by an English speaker), but the subjects did not notice the change. Changes from male to female speaker were usually identified, and a change to a pure tone was always identified. Reversed speech, such as a tape played backwards (having the same spectrum as the original signal, but no semantic content), was identified as having ``something queer about it by a few listeners, but was thought to be normal speech by others. In summary, the broad statistical properties of the signal in the rejected ear were recognized, but details such as language, individual words, and semantic content were unnoticed. In an interesting variant of these studies, the same recording was played to both ears with a variable delay between the ears. The experiment would proceed as above, with the subject shadowing one recording. The time delay was slowly decreased, until at a point when the recordings were within 2-6 seconds of each other, the subject would exclaim something like ``my other ear is getting the same thing. Nearly all the subjects reported that at some point they had recognized that words or phrases in the rejected ear were the same as in the attended ear. Note that this result is surprising in light of the previous tests where the subjects were unable to identify even a single word in the rejected ear. By switching one message periodically between the ears, the time interval needed to transfer attention between the ears was determined. For most subjects this interval was about 170 ms. A further study investigates this in more detail, defining as the average ``word recognition delay [CT54]. Note that represents the entire complex hearing process, and is not just because of the sensory system.
Posted on: Sun, 07 Sep 2014 20:49:47 +0000

Trending Topics



Recently Viewed Topics




© 2015