SPEECH ACOUSTICS NOTES
R. Port

March 4, 2000

1 . Speech Spectra and Waves

Speech sounds can be represented graphically either as a waveform or as a sound spectrogram. But waveforms are more difficult to interpret because of the very broad range of spatial scales. A sound spectrum is easier to interpret. For details, see Acoustics Studyguide.

2. Acoustic Theory of Speech Production

This theory says that speech sound spectra are the product of one (or more) sound sources multiplied by (selectively scaled or filtered by) the cavities in front of the source, depending on the size and shape of the filtering cavities. Then, upon transmission out of the vocal tract, due to differential diffraction (of low frequencies more that high), the signal measured in front of the face shows a bias for higher frequencies. This theory is illustrated by the figure, where U(s) represents the source spectrum, T(s) the transfer function, R(s) the radiation characteristic and P(s) the output spectrum.

This illustration shows a periodic glottal source, but high pressure airflow through a narrow constriction can produce a noisy source (lacking the evenly space harmonics of U(s) shown above.

3. Implications of the Acoustic Theory of Speech Production.

  1. Given ability to change vocal tract shape, the theory predicts changes in transfer function due to articulatory gestures (like narrowing or protruding the lips, moving the tongue, etc).
  2. The theory predicts that temporal overlapping of articulatory gestures will cause overlapping acoustic effects (that is, coarticulation).
  3. Given differences in physical size of individual vocal tracts, it predicts differences in transfer function due to size.
  4. Predicts independence of a sound source (eg, choice of vocal pitch or voicing vs. noise) and filter (eg, vowel choice).
  5. Predicts certain gestures will have identical acoustic effects (since, eg, lip protrusion has same effect on transfer function as lip constriction and larynx lowering. The retroflexed vs. bunched [r] are another example. (articulatory equivalence)
  6. Predicts that certain changes in shape will have large acoustic consequences while other shape changes will have smaller acoustic consequences (quantal properties) (See K.N. Stevens).

4. Features on Sound Spectrograms.

Sound Spectrogram: a graphic display of the frequency components added together to make a sound displayed over an interval of time. (Up to 2 seconds or so.)
Prominent Features on a Spectrogram: See Figure 8.17 ( p. 209) in Ladefoged text. Note in upper panel (a wide-band spectrogram): formant 1 (F1), formant 2 (F2), formant 3 (F3), stop closure, fricative noise, vocal fold pulses (or, loosely, `voicing' as regularly spaced veritical stripes). In lower panel ((narrow-band spectrogram), note formants 1-3 (now harder to see), stop closure intervals, fricative noise (as irregular vertical stripes) and harmonics of the fundamental frequency.

5. Vowels. Regions of strong harmonic energy (with visible glottal pulsing). There are two useful graphs you need to know to understand vowel basics. The positions of F1 and F2 are the main variables.

Frequency X Time Graph. If we pronounce the peripheral vowels in order slowly from [i] to [a] to [u], then F1 starts low (for [i]), rises to a maximum for [a] and then falls back to a low value for [u]. Over the same series, F2 begins at its maximum value for [i] and falls monotonically through [a] to its lowest value for [u]. The outline of the graph can be seen in Figure 8-5 (p. 193).
F1 X F2 Graph . If F1 is plotted against F2 on a plane, then [i], [a] and [u] occupy corners of a triangle. All the other vowels lie within this triangle. If you rotate the axes the right way, this triangle resembles the triangle of the auditory vowel space and the articulatory space. Ladefoged prefers to plot F1 against (F2-F1), that is the difference between F1 and F2 (which will be fairly similar to the value of F2). See, eg, Figure 8-7.

6. Stops: temporal regions with low energy (since the vocal tract is stopped) -- usually 50-120 ms duration. First is a phase of moving toward stop closure, then a steady-state closure phase, followed by distinct acoustic burst of energy, usually involving a short interval of noise.

Voicing Feature. In principle, voiced vowels and consonants have glottal pulsing when voiced but have no pulsing when voiceless. But in English the distinction is more complicated since timing plays a major role: longer preceding V and shorter C for voiced, shorter V and longer C for voiceless.
Place of Articulation. Place cues are visually subtle on sound spectrograms.

Labial: F2 and F3 fall going into a labial and rise coming out.

Apical: High-frequency energy in stop burst (in [s] region above 3300 Hz) plus F2 aims toward or from 1800 Hz `locus'.

Velar: lower pitched burst between F2 and F3 and often F2 and F3 merge toward each other entering the stop and diverge on release. Ladefoged describes similar place cues in Table 8.1 (p. 203).

7. Fricatives. When a narrow constriction is made and air is forced through it under high pressure (due to closed velum and contracting chest cavity), the airflow becomes turbulent in the constriction producing noise sound at a wide range of frequencies.

Place of Articulation. The [f] has broad spectrum, weak noise energy (since there is no resonating tube). The [s] and [z] fricative have very highpitched energy (above 3.5 kHz). The other fricative have a center frequency related to the length of cavity in front of the constriction - the longer it is, the lower the pitch. For [sh], the energy peak lies between F2 and F3.
Voicing. 1. Fricatives have weaker acoustic energy when voiced than voiceless (due to lower air flow). 2. The cues for place resemble those of the stops.

8. Nasals. Nasal stops usually look weaker in energy than vowels and are usually steady-state. Nasalization on vowels causes some `blurring' of the formant energy but is generally difficult to see.

9. Glides. The glides and resonants show strong formants sweeping up or down. English [r] has F3 dipping quite low (usually below 2 kHz). The [l] usually has F2 low but F3 high, while [w] has both F2 and F3 lowered.