Responding to One of Two Simultaneous MessagesA Review of The - TopicsExpress



          

Responding to One of Two Simultaneous MessagesA Review of The Cocktail Party Effect PART# 2 Barry Arons MIT Media Lab 20 Ames Street, E15-353 Cambridge MA 02139 [email protected] Spieth et al. at the Navy Electronics Laboratory in San Diego performed a series of experiments investigating responses to the presentation of simultaneous messages [SCW54]. The goal of the first set of experiments was to find conditions under which a communications operator could best recognize and attend to one speech message when it was presented simultaneously with another irrelevant message. Communication messages do not provide visual cues to aid in the identification of the sender or the perception of the message. While redundancy within a message is high, competing messages are of similar form, content, and vocabulary. Several configurations were tried that presented messages with horizontally separated loudspeakers. It was found that three loudspeakers (at -10, 0, and +10 azimuth) increased channel identification scores over a single loudspeaker (at 0 azimuth), and that a larger separation (-90, 0, and +90 azimuth) improved scores further (footnote-2). Variants of this experiment were performed (e.g., with added visual cues, low-pass filtering the messages, etc.), and an increased horizontal separation always reliably improved scores. Messages that were high- and low-pass filtered at 1.6 kHz, improved the operators ability to answer the correct message and identify the channel. Note that the filtering did not significantly decrease the intelligibility of the messages. Both the high- or low-pass messages were made easier to attend to, and they could be separated from an unfiltered message. Spieth relates this phenomenon to Cherrys work on transition probabilities: ``this suggests the possibility that anything which increases the element-to-element predictability within each of two competing messages and/or decreases the predictability from an element in one stream to a succeeding element in the other stream, will make either stream easier to listen to. Note that this fundamental theme resurfaces throughout many of the studies. The authors propose that further narrowing the frequency bands, and increasing the separation between them will further improve the ability to listen to either stream. This is, however, limited by the point at which the bandwidth is so narrow, or frequency so extreme, that intelligibility of the individual messages is impaired. If two or more separation aids were used at the same time (e.g., filtering and spatial separation), scores were usually improved with respect to a single aid, but the effect was not fully additive. The authors hypothesize that the reason the effects were not additive was because of the general ease of the tasks (i.e., it was not difficult to achieve a score of 100%). Responding to Both of Two Simultaneous Messages A related study by Webster and Thomas investigated responding to both of two overlapping messages [WT54]. As in the previous experiment, more correct identifications for sequential messages were found using six loudspeakers than one. Having a ``pulldown facility (the ability to manually switch the audio from one particular loudspeaker to a headphone or near-field loudspeaker) gave considerably better results. It was also found that the louder of the two simultaneous messages was more likely to be heard correctly. Note, however, that having multiple loudspeakers did not improve results when it was necessary to attend to two competing simultaneous messages. The ability to rapidly shift ones attention (e.g., with multiple loudspeakers) does not help if the information rate is high. Under the worst conditions (two simultaneous messages), only 60% of the information was received, but this results in a greater total information intake per unit time than if the messages had occurred sequentially. Selective Listening to Speech In 1958, Broadbent summarized much of this early work, including his own experiments, and that of a variety of other researchers [Bro58]. It had been experimentally established by that time that the probability of a listener correctly hearing a word varies with the probability of the word occurring in a particular context. For example, after hearing the word ``bread, the subsequent occurrence of ``butter or ``knife is more likely than ``eraser or ``carburetor. In 1951 it was shown that a word is less likely to be heard correctly if the listener knew that it was one of many alternatives as compared with a small number. The performance of selective listeners thus seems to vary with information as defined by communication theory, rather than with the amount of physical stimulation. Broadbent concludes from Websters experiments that messages containing little information can be dealt with simultaneously, while those with high information content may not. He notes that the statement ``one cannot do two tasks at once depends on what is meant by ``task. It is pointed out that spatial separation is helpful in situations that are similar to the task of the listener ignoring one channel and responding to the other--the spatial effect is less important when the listener is dealing with two channels simultaneously. Note also that the time to shift attention is increased when two messages come from different directions, and that this may cancel out other advantages of spatial separation. Broadbent summarizes the three main conclusions of the selective listening experiments as: Some central nervous system factors, rather than sensory factors are involved in message selection. Effects vary with information content of the messages. When information must be discarded, it is not discarded at random. If some of the information is irrelevant, it is better for it to come from a different place, to be at a different loudness, to have different frequency characteristics, or to be presented to the eye instead of the ear. When no material is to be discarded, there is little advantage in using two or more sensory channels for presenting information. Binaural Unmasking Our ability to detect a signal in a background masking signal is greatly improved with two ears. Under ideal conditions, the detection threshold for binaural listening will exceed monaural listening by 25 dB [DC78]. Consider, for example, a control condition where a signal and noise are played to a single ear. If the signal is then played simultaneously to both ears, but the phase of the noise to one ear is shifted by 180 with respect to the other ear, there is a 6 dB improvement in the detectability of the signal. This improvement over the control condition is called the binaural masking level difference (BMLD or MLD). If the noise is played to both ears, but the signal to the ears is 180 out of phase, there is a 15 dB BMLD. The cocktail party effect can thus be partly explained by BMLDs. When listening binaurally, the desired signal coming from one direction is less effectively masked by noise that originates in a different direction [Bla83]. Such a technique is often exploited in earphones for fighter pilots to help separate speech signals from the high noise level of the cockpit. The headphones are simply wired so that the signal presented to one ear is antiphasic (180 out of phase) with the signal presented to the other ear. Auditory Scene Analysis A great variety of research relating to perceptual grouping of auditory stimuli into streams has recently been performed, and summarized, by Bregman [Bre90]. In the introduction to his book, Bregman talks about perceptual constancies in audition, and how they relate to vision: A friends voice has the same perceived timbre in a quiet room as at a cocktail party. Yet at the party, the set of frequency components arising from that voice is mixed at the listeners ear with frequency components from other sources. The total spectrum of energy that reaches the ear may be significantly different in different environments. To recognize the unique timbre of the voice we have to isolate the frequency components that are responsible for it from others that are present at the same time. A wrong choice of frequency components would change the perceived timbre of the voice. The fact that we can usually recognize the timbre implies that we regularly choose the right components in different contexts. Just as for visual constancies, timbre constancy will have to be explained in terms of a complicated analysis by the brain, and not merely in terms of a simple registration of input by the brain. There are some practical reasons for trying to understand this constancy. There are engineers that are currently trying to design computers that can understand what a person is saying. However in a noisy environment, the speakers voice comes mixed with other sounds. To a naive computer, each different sound that the voice comes mixed with makes it sound as if different words were being spoken, or as if they were spoken by a different person. The machine cannot correct for the particular listening conditions as the human can. If the study of human audition were able to lay bare the principles that govern the human skill, there is some hope that a computer could be designed to mimic it. ([Bre90] page 2) Scene analysis in audition is concerned with the perceptual questions of deciding how many sound sources there are, what are the characteristics of each source, and where each source is located [Han89]. A baby, for example, imitates its mothers voice, but does not insert the cradle squeaks that have occurred simultaneously with the mothers speech. The baby rejects the squeaks as not being part of the perceptual object formed by the mothers voice--the infant has solved the scene analysis problem in audition. Bregman also states the problem a different way: ``. . . it would be convenient to be able to hand a spectrogram over to a machine that did the equivalent of taking a set of crayons and coloring in, with the same color, all the regions on the spectrogram that came from the same source. This is what auditory scene analysis is all about. Sounds or acoustic events are created when physical things happen. The perceptual unit that represents such a single happening is called an auditory stream. A series of footsteps, for example, each represent individual sounds, yet are usually experienced as a single perceptual event. Streams are a way of putting sensory information together. If the properties ``far and ``lion roar are assigned to one auditory stream, and ``near and ``crackling fire assigned to a different stream, we will probably behave differently than if the distance percepts were reversed [Bre90, Han89]. Many of the ideas of auditory scene analysis can be traced back to visual work done by the Gestaltists of the early 1900s [Han89]. Visual and auditory events are combined to make the most coherent perceptual objects. Elements belonging to one stream are maximally similar and predictable, while elements belonging to different streams are maximally dissimilar. The Gestalt psychologists organizational principles of the visual field include: Similarity: elements that are similar in physical attributes tend to be grouped Proximity: elements that are close together in space or time tend to be grouped Continuity: elements that appear to follow in the same direction tend to be grouped Common Fate: elements that appear to move together tend to be grouped Symmetry & Closure: elements that form symmetrical and enclosed objects tend to be grouped From this perspective, we expect acoustic events that are grouped into one perceptual stream to be similar (e.g., in frequency, timbre, intensity), to be in spatial or temporal proximity, and to follow the same temporal trajectory in terms of frequency, intensity, position, rhythm, etc. Primitive Segregation The focus of Bregmans work is on primitive, or unlearned, stream segregation. The following sections qualitatively summarize many of Bregmans findings that are relevant to the cocktail party effect. These ideas begin with general attributes of auditory scene analysis, and will move toward, and emphasize, the perception of speech streams. Grouping Processes. There are two classes of grouping processes that can be broadly classified as simultaneous integration and sequential integration (these can also be called spectral grouping and temporal grouping). The following figures visually illustrate these types of groupings (the circles represent sounds at a particular frequency). In the figure below (after [Bre90]), the segregation is stronger in figure 1b than figure 1a, as the frequency separation between the high and low tones is greater. Similarly, the segregation is still greater in figure 1c where there is an increase in speed. The tones are more tightly packed in both the visual representation and the auditory stimuli. Spatial Location. Primitive scene analysis groups sounds coming from the same location and segregates sounds that originate in different locations. As Cherry and others showed, a person can do a good job of segregating sounds from monaural recordings. Spatial cues are strongest when they are combined with other auditory cues--spatial evidence is just one cue in the complex scene analysis system. Note also that reflections (e.g., room, body) can significantly alter received acoustical signals. Engineers working in the automatic segregation of concurrent sounds have used spatial separation as a uniquely powerful way of determining whether the sounds have come from the same physical event (usually a talker). Humans use spatial origin too, but do not assign such an overwhelming role to it. They can do quite well at segregating more than one stream of sound coming from a single point in space, for example, from a single loudspeaker. ([Bre90] page 644) Spatial Continuity. Sound sources (talkers) and listeners dont move too far or too fast. Experiments have shown that spatial discontinuities break down streams, so spatial continuities must be important at holding streams together. Loudness differences. Differences in loudness may not, in themselves, cause segregation, but as with spatial location, such cues may strengthen other stream segregation evidence. Continuity. Sounds hold together in a single stream better than discontinuous sounds. This continuity can be in fundamental frequency, temporal proximity, shape of spectra, intensity, or spatial origin. It is unlikely that a sound will begin at the same instant that another sound ends--when the spectra of incoming sensory data change suddenly, we conclude that only one sound has started or stopped. A complicated spectrum, for example, may have a simpler spectrum embedded in it that was heard earlier. This simpler spectrum may be adjacent to the more complicated spectrum with no discontinuity. It is therefore reasonable to consider the part of the spectrum that matches the earlier one as a continuation of it, and treat the latter portion as resulting from the addition of a new sound to the mixture. Visual Channel Effects. We tend to perceive sounds as coming from locations of visual events. Think of the illusion when watching television or a movie, where an actors voice appears to be emanating from his mouth regardless of where the loudspeaker is located. An example of the interrelationship is that the grouping of sounds can influence the grouping of visual events with which they are synchronized and vice versa the tendency to experience a sound as coming from a location at which visual events are occurring at the same temporal pattern (the so-called ventriloquism effect) can be interpreted as a way in which visual evidence about the location of an event can supplement unclear auditory evidence. The direction of influence is not just from vision to audition, but in the reverse direction as well. ([Bre90] page 653) Thus our interpretation of auditory spatial cues is strongly influenced by our perceived visual orientation. Or, more correctly, the highest level of spatial representation involves an integration of information from the different senses. ([Moo89] page 224) History. Stream analysis processes use history to adjust for momentary spatial estimates. We use the fact that sounds and objects tend to move slowly in space and time and hence cause coherent structure. Segregation Time Constant. It takes at least four seconds to build up and segregate a stream, and four seconds for it to go away after the sequence stops. This long time constant probably prevents the auditory system from oscillating under ambiguous conditions. However, a sudden change in the properties of a signal can reset the streaming mechanism more quickly than can silence. Harmonics and Frequency Modulation. The perceived pitch of a complex tone depends on an estimate of the fundamental frequency of the set of harmonics that make up the tone (even if the fundamental is missing). The scene analysis mechanisms favor the grouping of harmonics of the same fundamental. Thus if several fundamentals account for all harmonics, we conclude that there are several sound sources. When the pitch rises, not only does the fundamental frequency go up but all the harmonics go up by the same proportion too. It is plausible to believe that this correlated change, if it could be detected auditorily, could tell us that the changing partials all came from the same voice. The auditory system could group all such correlated changes and hear only one changing sound. There is evidence to suggest that two types of frequency change (or modulation) are used for this purpose. One is micromodulation, the tiny fluctuations of the pitch of human voices that occur even when the speakers think they are holding a steady pitch The other type of frequency modulation is the slow kind that occurs when we voluntarily vary the pitch of our voice in a smooth way as we do, for example, when we raise out pitch at the end of a question The synchronization of the micromodulation or of slow modulation in different parts of the spectrum seems to cause those parts to be treated as parts of a single sound. ([Bre90] page 657) Weighting of Evidence. There is collaboration, as well as competition, among the features used in a stream segregation decision. If the number of factors that favor a particular grouping of sounds is large, the grouping will be strong, and all the sounds will be heard as part of the same stream.
Posted on: Sun, 07 Sep 2014 20:51:02 +0000

Recently Viewed Topics




© 2015