Audition for Linguists

AUDITION FOR LINGUISTS
Robert Port
September 1, 2007

This handout presents some basic information about human hearing by focussing on what seems most relevant for language in my opinion. At the conclusion, there is discussion of how speech is recognized and language understood. For a more systematic survey of hearing science on the internet, check this site by Chris Darwin at the Sussex Univ online tutorial on major topics in auditory psychophysics.
<http://www.biols.susx.ac.uk/home/Chris_Darwin/Perception/Lecture_Notes/Hearing_Index.html>

1. Anatomy and physiology of the ear.

The image “http://webschoolsolutions.com/patts/systems/ear.gif” cannot be displayed, because it contains errors.

MECHANICS OF EXTERNAL AND MIDDLE EAR: The eardrum (tympanic membrane) is pushed by sound waves and the force is transmitted through the ossicles (hammer, anvil, stirrup) acting as levers to the oval window. A large motion of the eardrum leads to a smaller but more forceful motion at the oval window. This pushes on the (incompressible) fluid of the inner ear and waves begin moving down the basilar membrane of the cochlea.

The image “http://www.pc.rhul.ac.uk/zanker/teach/PS1061/L5/inner_ear.gif” cannot be displayed, because it contains errors.

BASILAR MEMBRANE (BM) is the spiral-shaped organ within the cochlea that mechanically separates frequency components of sound. Different frequencies cause local maxima in the amplitude of the travelling wave that moves down the basilar membrane at different locations. High frequencies produce maximum response at the low (early/basal) end (closest to the oval window). Low frequencies move farther down the BM before reaching a maximum. The lowest frequency with a unique location (at the very top end of the cochlea) is about 200 Hz.

Organ of Corti is a spiral structure lying on the basilar membrane that converts membrane motion into neural pulses. It is designed so that 4 rows of hair cells (one row of inner hair cells and 3 rows of outer hair cells) are mechanically disturbed by the passage of the sound wave along the BM. Each hair cell connects to a neuron that takes the signal to the central auditory system.

The Auditory Pathway is the sequence of nuclei in the brain stem leading from the cochlea up to the auditory cortex in the temporal lobe of the cerebrum. The primary connections from the two cochlear nuclei are to the opposite hemisphere. Each nucleus (the cochlear nucleus, superior olive, lateral leminiscus, inferior colliculus and thalamus) has a tonotopic map (that is, a frequency-based map). Some amount of lateral inhibition appears to occur at each level. The target in cortex on each side is the superior and interior surface of the temporal lobe.



LATERAL INHIBITION: At several locations in the auditory system there are `frequency maps', that is, places where each cell in a sheet of cells responds maximally to input at a particular frequency and its neighbors in one direction respond to higher fs and in the other direction to lower fs. The first of these maps is on the basilar membrane itself. Since almost any input wave physically disturbs the hair cells along much of the length, there is a problem of cleaning up this disturbance so that only the appropriate haircells (for the input frequency) are able to transmit their signal. On such a map, if each nerve cell inhibits its neighbor in proportion to its own excitation, then a blurry pattern (such as the grey envelope of mechanical disturbance in the figure) can be sharpened (into the narrow peak below).   If cell A inhibits its neighbor, cell B, with value 4, and B inhibits A with value 6, then very shortly A will slow down its firing but B will continue. Thus, if there is a whole field of these with each inhibiting its neighbors on both sides for a little distance, then only the most excited cell will win in each neighborhood. This is how the gross wave motion down the BM can be sharpened up so that the peak of the envelope of membrane motion will be identified.
    Of course, we need to hear more than one possible frequency so these neighborhoods of inhibition are fairly small. If you have a vowel with several independent formant frequencies, then there will be multiple local peaks of activity. Thus something that is fairly similar to a spectrum slice is produced along the basilar membrane at each moment in time for energy above 200 Hz.

PLACE CODING vs TIME CODING: As noted, only frequencies above 200 Hz have unique maxima on the BM. This means that particular fibers in the auditory nerve are maximally active when their frequency is present in sound. But the mechanical action of the cochlea is such that frequencies below about 4000 Hz are transmitted directly as waveforms into the auditory nerve. That is, for low frequencies, the BM acts rather like a microphone translating mechanical pressure waves into waves of neural activation in time. This means that, between about 200 Hz and 4 kHz there are both a place code (where activity in specific fibers indicates spectral components in the sound) and a time code (where the temporal pattern of sound waves appears as a temporal pattern of fiber activation). Below 200 Hz, there is only a temporal representation

The image below from Shihab Shamma (from Trends in Cognitive Science, 2001) shows the basilar membrane (unrolled) on the left (a) with computationally simulated auditory nerve representation of 3 patterns in (b). The top, marked (i) inside the image, is a low intensity pattern of a 300 Hz plus a 600 Hz tone. Notice that the 2 sine waves each produce a horizontal bar (if you smooth in time) at 300 Hz and 600 Hz on the vertical scale showing center frequencies (CFs) in Hz, but also, along the time scale, you can count each cycle of the 2 waves -- and 600 Hz has twice as many cycles as the 300 Hz tone. Next down, (ii) shows the same tones but much higher intensity input. The third panel is the phrase ``Right away'' (probably by a female talker with F0 around 200 Hz). Column (c) suggests the lateral inhibitory network (in some lower brainstem nuclei). Column (d) shows the simulated spectrogram available to the higher-level auditory system for analysis. Notice that (bi) and (bii) look about the same in the (d) column despite large differences in the auditory nerve excitation pattern of column (b).

For the speech example in (iii), notice that the auditory nerve excitation pattern (in panel b iii) is very difficult to interpret (even though several harmonics are pointed to by arrrows along the right edge), but in (d iii), after cleanup by lateral inhibition, you can see about 5 distinct harmonics in the low-frequency area, as well as F2, F3 and F4 (where the harmonics are merged into formants). Notice the rising F3 sweep for the [r] of right (near the upper left corner of the figure) and the F2 sweep for the [wej] in away (in the center and right of the figure).    Where is F1? Its must still be extracted from the first 4-5 harmonics. This display shows approximately the information available to the central auditory system, that is, to cortical area A1 and other auditory areas. Clearly it resembles a traditional sound spectrogram except that the lower frequencies region looks like a `narrowband' spectrogram (separating the harmonics of F0) and the higher regions look like a `wideband' spectrogram merging the harmonics into formants. Keep in mind, of course, that the time scale here is due to a technological trick. The ear just produces a vertical slice of this image at each moment in time, then moves on to the next vertical slice. And, of course, actual pattern recognition for identifying words is still required, of course.

    There is one additional kind of shaping of the acoustic signal that Shamma left out here by choosing the specific example utterance he used. If there are abrupt acoustic onsets in the signal (like at the 3 vowel onsets of soda straw) then there is a nonlinear exaggeration of the energy onsets (Delgutte, 1997). This makes vowel onsets far more salient than the onset of high-frequency consonant energy (in syllables like so- and straw). This can be seen clearly in the figure below showing the response of 4 fibers in the auditory nerve (of a cat) across the F1 frequency range. These fibers have center frequencies (CFs) at about 400 Hz, 560 Hz and 780 Hz. The arrows below point to the locations of abrupt audio onsets. Note the beginning of the [a] in `father'. Notice that each vowel onset (marked with black arrows below) causes a very sharp peak in auditory nerve response when the vowel begins. These events of F1 onset are what speakers line up temporally with a `beat', eg, a pencil tap, not the onset of high frequency energy (eg, the burst of a [t] or onset of an [s]). The reason the high-frequency onsets are mostly ignored is that the low frequencies contain the majority of physical energy in speech signals.

2. Psychophysics

    Psychophysics is the study of the relationship between physical stimulus properties and sensory responses of the human nervous system as reported by subjects in experimental settings. Auditory psychophysics is specifically concerned, of course, with hearing.
       PITCH, a psychological dimension closely related to physical frequency. The relation between perceived pitch and frequency is roughly linear from 20 Hz to about 1000 Hz (thus including the range of F1 but only the lowest part of the range of F2. Thus the perceived amount of pitch change between 200 Hz and 300 Hz is about the same as the change from 800 Hz to 900 Hz. From 1000 Hz to 20 kHz, the relation is logarithmic. Thus the pitch change between 1000 Hz and 1100 Hz (a 10% difference) is about the same as the difference between 5000 Hz and 5500 Hz (also 10% although a 5 times larger change in linear Hz) and the same as 12000 Hz and 13200 (10%). An increase in frequency of 100 Hz beginning at 15 kHz, is about 1% and is just barely detectible. There are several scales used to represent pitch (such as the mel scale and bark scale) where equal differences represent equal amounts of pitch change. The figure below shows mels on the vertical scale plotted against two different frequency scales. The solid line is labelled across the top in linear Hz and the dotted line is labelled across the bottom in log Hz.

The significance of this graph is that only low frequencies are resolved well. At high frequencies, frequencies are scrunched together and only large differences are noticeable. Notice also that musical scales are basically logarithmic: going up one octave doubles the frequency. So the musical scale is like auditory pitch only above about 1000 Hz. Both barks and mels turn out to be interpretable in terms of linear distance along the basilar membrane. That is, listeners' judgments about amount of pitch difference are largely judgments about distance along the basilar membrane.

    CRITICAL BANDS. What happens as frequencies are `scrunched' at higher frequencies? Why are the harmonics in the Shamma figure merged for F2 and F3 but not for F1 in panel (d iii)? The ear, like any other frequency analyzer, employs a bank of filters so it can measure the energy level over some range of frequencies. These filters in the ear have come to be called `critical bands' and are wider in Hz at higher frequencies than at lower frequencies. Below 800 Hz or so, the filters have a roughly constant width of about 100 Hz but they get wider and wider as frequencies go up (thus blurring frequency distinctions). In fact, however, the width of the critical bands is roughly a constant width along the basical membrane (just like the mel scale!). Thus, a scale for pitch can be defined using critical bands as the units. This is what the bark scale does (developed by Zwicker in 1960), going from 1-24 barks (over the range from 20 Hz to about 15kHz).

    LOUDNESS is a psychological dimension closely related to physical intensity. Below is shown a graph of equal loudness contours on the plane of frequency (displayed as a log scale) against intensity (sound pressure level or dB SPL), a measure of physical energy in sound. The numbers within the graph represent `loudness levels' obtained by playing pure tones at various points in the plane and asking subjects to adjust a 1000 Hz tone to have the same perceived loudness. Notice that for both lower frequencies and very high frequencies, higher intensity is required to have the same loudness as 1000 Hz tone. (This is why you need to boost the bass on your stereo system if you turn the volume down low: otherwise the lows become inaudible.)

.

    Another issue is the neural representation of intensity differences, that is, the neural representation of acoustic dynamic range. As noted, human hearing has a dynamic range of over 100 dB. But individual neural cells rarely have more than a very small dynamic range in activation level -- between background-level firing (mean no detected signal) and maximum rate of firing. The nervous system has a variety of tricks, some quite complex, to achieve the 100 dB dynamic range.

3. Pattern Learning.
So far, we have discussed the way sound is transformed when it is converted to neural patterns and `sent' (as they say) to the brain. Naturally, what counts is pattern recognition, and the patterns that matter for speech are primarily those due to moving articulatory gestures over time. There are neural systems that are specialized for recognition of time-varying patterns of a very general nature, and gradually, there develop neural pattern recognizers for the patterns appropriate for the environmental auditory events to which one is repeatedly exposed. This kind of learning seems to be at the level of auditory cerebral cortex in the temporal lobe. This learning is rote and concrete - specified in a space of specific frequencies and temporal intervals.

    PRESPEECH AUDITION. There is now a great deal of evidence that children during their first year learn to recognize the important ``sounds'' of their ambient language (Werker and Tees, 1984). The allophone-like spectro-temporal trajectories that occur frequently are recognized as fragments of various sizes. The fragments that are particular to speech signals could be called `auditory-phonetic features.' Children begin developing these even before birth and we continue to learn new ones throughout our lives. And any sound patterns that occur only very infrequently or have never been seen before tend to be ignored or grouped with near neighbors that do occur frequently (These are the phenomena that justify `categorical perception,' the `perceptual magnet effect' etc.). This clustering process or partial categorization of speech sounds begins to takes place at an age when infants produce no words and recognize only a small number. (And they certainly know nothing about alphabets yet.)

    COMPLEX PATTERNS.

       There has been some research on listeners' ability to learn complex nonspeech auditory patterns, but not as much as we need. If a very complex but completely unfamiliar and novel auditory pattern is played to an adult, they will be able to recognize very little about it. But if the pattern is repeated many times, eventually, subjects construct a very detailed, description of the pattern.   Charles S. Watson did many experiments using a sequence of 7-10 brief pure tones chosen randomly from a very large set of possible tones. (The example in the figure below has 4 tones.)   A complete 10-tone pattern was completed in less than a second. Such patterns are completely novel and extremely difficult to recognize. In his many experiments, Watson would typically play some random tone pattern 3 times in a single trial, called A, B and X respectively, to a subject and change the frequency of one of the tones between the A presentation and B.

Then the subject has to say if X, the third pattern instance, is the same as the first or second (known as an ABX task). If the patterns are novel, this is an extremely difficult task. However, if a subject is given the same pattern for many trials, eventually, they learn the details of that pattern and can detect small changes (delta f) in one of the tones. That is, they learn a detailed auditory representation for any pattern given enough opportunities to listen to it.

4. Speech Perception

    Perception of complex patterns like speech is a very difficult problem. The traditional view about language is that listeners store words in a form resembling their written form, using abstract, letter-like units called phones or phonemes. Each feature is supposed to have a simple description that applies to all instances of the feature. However theories of speech perception based on such units have had little success (see Liberman et al, 1968; Port, 2007) and automatic speech recognition systems that work by recovering phones (or phonemes) from which words are spelled have also had little success (see Huckvale, 1997). There is an alternative view in which words have rich detailed auditory representations (that ignore alphabet-like representations) in favor of storing words as detailed, concrete auditory patterns, that is, as trajectories in auditory space (see, eg, Kluender, Diehl, Lotto). On this view, speech perception is acquired like other complex auditory (or visual) patterns -- by simply storing a great many rather concrete (that is, close-to-sensory) representations of instances or tokens (sometimes called exemplars). This hypothesis places much greater burdens on memory, of course, but humans are capable of remembering lots of details (Port, 2007).
    When we become skilled at speaking our first language, we become effective categorizers of the speech sounds, the words in our language, etc. So, abstract categories of speech sound (like letters, phonemes, syllable-types, etc.) can eventually be learned. A category is a kind of grouping of specific phenomena into classes based on some set of criteria. (Of course, a category is only remotely similar to the kind of symbol token usually assumed to provide the output of the speech perception process. Symbols, like phonemes and phones, are assumed to be stable indefinitely in time and to be flawlessly identified. See Port and Leary, 2005.)
    My proposal is that speech perception is based on nonspeech, auditory perception. Each child learns a descriptive system of auditory-phonetic features appropriate for the language he is exposed to. These features are both spectral and temporal, that is, they represent patterns of frequencies that extend in time. But since they are developed in the individual child, the features surely differ in detail from speaker to speaker. Still, these high-dimensional representations of these features, records of specific utterances, can be stored in large numbers. A body of detailed auditory records in memory provides, on this view, the database on which linguistic categorization decisions are made using, presumably, some form of statistical pattern recognition.
    I must confess that this is a minority opinion regarding speech perception and I also agree that it violates our intuitions which assure us that words are spelled from abstract and invariant letter-like units. My view is that our intuitions have been shaped by our extensive training using letters to read and write . So our intuitions here seem to be obscuring the way language perception works (see Port, 2006, Port and Leary, 2005). As far as I can tell, the experimental evidence consistently supports a mostly concrete auditory basis for speech perception. Any letter-like description of speech probably plays a very small role in realtime speech production and perception. (Such low-dimensional, phonological descriptions suit the patterns in the speech data of a community, the database speakers learn from, but do not well represent the patterns realized explicitly within each speaker.)

5. Conclusions

    This has been an overview of hearing science as it relates to speech perception. The first focus was on the transformations of the audio signal that are performed by the peripheral and central auditory system for presentation to the learning system. It has been shown that the ear and lower auditory system process the speech signal, in part, by producing a realtime frequency analysis. Unlike the ``auditory spectrogram,'' however, there is no neural representation of the time axis. Time is really just time, not space. Finally, a view of speech perception was presented that is compatible with the view of concrete auditory representations of speech signals, rather than the abstract cues that are required for recognition of abstract symbols.

Selected References

Delgutte, Bertrand (1997) Auditory neural processing of speech. In Hardcastle and Laver (eds.) The Handbook of Phonetic Sciences. Oxford University Press, pp. 507-538.

Diehl, Randy, Andrew Lotto and Lori Holt (2004) Speech perception. Annual Review of Psychology 55, 149-179.

Liberman, Alvin M., Franklin S. Cooper, Donald P. Shankweiler and Michael Studdert-Kennedy (1968) Perception of speech code. Psychological Review 74, 431-461.

Moore, Brian (1997) Introduction to the Psychology of Hearing, 4th Ed. Academic Press.

Pickles, James (1988) An Introduction to the Physiology of Hearing, 2d Ed. Academic Press.

Port, Robert (2006) The graphical basis of phones and phonemes. In Murray Munro and Ocke-Schwen Bohn (eds.) Second Language Speech Learning: The Role of Language Experience in Speech Perception and Production. Benjamins, Amsterdam. pp. 349-365.

Port, Robert (2007) How are words stored in memory? Beyond phones and phonemes. New Ideas in Psychology (Elsevier)

Shamma, Shihab (2001) On the role of space and time in auditory processing. Trends in Cognitive Sciences 5, 340-348.

Werker, Janet and Richard Tees (2005) Speech perception as a window for understanding plasticity and commitment in language systems of the brain. Developmental Psychobiology, 46(3), 233-251.

Watson, C. S, William Kelly and Henry Wroton (1976) Factors in the discrimination of tonal patterns. II. Selective attention and learning under various levels of stimulus uncertainty. J. Acous Soc. Amer. 60, 1176-11