Phonetics and Motor Activity[1]

 

Robert F. Port

Department of Linguistics,

Indiana University

Bloomington, Indiana, 47405

September 14, 2000

 

 Introduction

The task of phonology and phonetics might be said to be to clarify the form of language as it interfaces with the physical world. This question is important for the theory of linguistics, but the traditional view in linguistics is that certain things are known in advance about this interface that follow from the nature of language itself. On this view, language is a cognitive or mental system that interacts with the physical world of speech articulation, sound, ears and auditory perception systems through a set of apriori mental symbols. It is assumed that phonetics can, and will eventually, provide a description of this interface, in a form resembling beads on a string, that is, as discrete symbolic objects.  These objects are conceived as an inventory of nameable tokens, like the IPA phonetic alphabet or the set of phonetic features in Chapter 10 of Sound Pattern of English (Chomsky and Halle, 1968).  So, linguistic theory places very specific demands on what the interface is like.  In particular, phonetics should provide an alphabet, that is, a list of phonetic tokens, with which all languages can be satisfactorily described.  It should discover the universal inventory of phonetic elements: static, serially-ordered, segmental, phonetic features that can be employed in simultaneous or temporally overlapping data structures. 

These characteristics are not merely convenient working assumptions that can be verified or revised as the data come in. These properties are actually required by the premise that language is a formal system. Language, like any other formal system, must be supplied with a set of formal atoms – discrete, static, and apriori (that is to say, universal) tokens - from which the rest of the phonology of a language (and the speaker’s cognitive dictionary) can be constructed.

What about time? What determines how long various events last? The formalist response is that the answer isn’t known but it is difficult enough to handle just the problem of serial order. Accounting for the action of these formal systems in continuous time (e.g., issues like rates of change between cognitive states, rates of temporal decay of representations, etc.) is too difficult a problem to be tackled all at once. And surely, they reason, the serially ordered description provided by a formal model will provide important and useful constraints on how continuous time accounts must work. Since segmental descriptions are the simplest, they should be sought first.  It will be argued in this manuscript that Chomksy’s formalist strategy for language research may have been a reasonable bet when generative grammar began in the 1960s, but it turns out to have been, in fact, a serious mistake that restricts our ability to see many kinds of linguistic phenomena.

Since modern linguists nearly always conduct their research in phonology and historical linguistics based on these assumptions, most linguists rightly consider phonetics to be an important research area.  But a summary observation from over 50 years of phonetics research is that despite a few obvious similarities between the sounds of different languages, the motor patterns for speech gestures in different languages are not in general commensurable.  One might hope, for example, that the phenomena of foreign accent (a problem that arises due to phonetic and phonological differences between languages), could be accounted for entirely using the universal phonetic alphabet to describe all the differences. But, given what’s been observed about speech details, it is extremely unlikely that such any alphabet exists in human nervous systems – as I argued in my exchange with Prof. Manaster-Ramer in Journal of Phonetics in 1996 (Manaster-Ramer, 1996a; Port, 1996, Manaster-Ramer, 1996b).

One of the reasons that I hold such a pessimistic view of universal phonetic atoms is that work on temporal aspects of speech keeps turning up phenomena that do not fit with such a view. As will be shown below, rather than phonetic research leading over the years to a ``homing in’’ on a short list of universal phonetic symbols, phonetics research finds, at the very least, an extremely large list of properties of speech that appear to be ``under grammatical control’’.  Not only will the number of phonetic atoms be unreasonably large, but these phonological properties frequently do not resemble symbols – that is, static configurations of ordered or hierarchical tokens – like beads. Instead some of these behaviors look like simple, routine activities of dynamical systems that are familiar to physicists and mathematicians: of mass-spring systems, pendula and mathematical creations like van der Pol’s equations (see, e.g., Abraham and Shaw, 1982; Norton, 1995; Kelso, Ding and Schöner, 1993).  Furthermore, the patterns that are found seem not to be even intrinsically cognitive, but much more fundamental. Some are more like universals of motor behavior and are common not just to various languages, but also to nonspeech human behavior, and even to animal motor behavior.  One implication of these broad generalizations is that it may be a mistake to assume that individual human languages are exclusively cognitive or mental.  Instead, languages are situated in human brains (see, eg, Clark, 1997) and thus should be expected to exhibit some properties of general animal behavior and sometimes even of inanimate physical systems.  

Many of these generalizations appeared from study of the `speech cycling’ task in my lab, where speakers are asked to repeat a short fragment of text over and over.  Simple repetition of a short piece of text seems inevitably to lead to rhythmic production. However, it cleans up our data to stabilize the rate of repetition by providing a metronome at a comfortable repetition rate. The specific topic we will focus on here is the notion of entrainment. This term is illustrated by cases where one dynamical system (most often an oscillator) influences the behavior of another, such that they become time-locked. This means that they go through the various phases of their cycle in  a fixed relationship.

A familiar large-scale example is a parent pushing a child on a park swing. Given the child’s mass and the length of the chain, the child-swing combination will have some particular rate of oscillation that is its resonant frequency. If you just give the child a single shove, it will oscillate at that particular frequency with a period, let’s say, of 4 seconds (2 s forward and 2 s back. The period will be invariant no matter what the amplitude of the swing.). If the parent wants to give the child a series of pushes in a comfortable and convenient way, the natural behavior is for the parent to entrain its own body to the child-swing pendulum system. The parent then, just by trying both to keep the swing going and to minimize their own effort, flexes and extends their whole body – legs, trunk, arms, hands – to the natural oscillatory period of the swing and child: 2s forward and 2s back. The two oscillators have come to behave as a single complex system – swing seat, child, chains and daddy – all of them oscillating in complex synchrony.   The event is partly constrained by physics, partly by motor control and partly by cognition. We would say then, that the parent has entrained himself to the child-swing system.  Of course, if the parent gets tired, there are other frequencies of entrainment that would also be satisfactory. The parent could push the child on every other cycle, oscillating their body, not with a 4s period, but with an 8s period. Other periods, like 6s, will not work so well since the parent would only find the child-swing in position to receive the push on every other push (that is, every 12s). On the even-numbered cycles, daddy risks falling down.

 

Self-entrainment.

It’s a very important characteristic of the behavior of animal bodies, one often overlooked, that animals frequently exhibit self-entrainment in their physical activity.  By self-entrainment I mean that a gesture by one part of the body tends to entrain gestures by other parts of the body.

  This property is wellknown in the literature of motor control. Researchers on motor control have commented many times on its general importance (e.g., Bernstein, 1967, Kelso, 1995, Turvey, 1990; Thelen and Smith, 1994).  However, cognitive science and linguistics, due to a hasty and ill-examined commitment to the symbolic formal approach, tend not to appreciate the problem of time (see Port and van Gelder, 1995; van Gelder and Port, 1995).  If one is interested in time, then any such simple temporal constraint would immediately strike one as potentially of great help in understanding the production and perception of patterns like words and sentences.  In this paper we will survey some everyday illustrations of periodic, oscillatory behavior, review a couple cases from the experimental literature and then report results on speech suggesting that even ordinary human speech exhibits self-entrainment more or less as soon as it is given the opportunity to do so.

 

Self-entrainment in Everyday Activity.

  Research on motor behavior in humans and also in other animals like fish, shows that when one gesture is performed simultaneously with a second gesture - even by a distant part of the body, the two gestures have a strong tendency to entrain themselves to each other. That is, cyclic gestures that could, in principle, be completely independent tend not to be: for example, walking and waving the hand, or repeated reaching with the two hands, or wagging the index fingers of each hand.  Different gestures tend to cycle in the ratio 1:1 or 1:n  (e.g., 1:2, 1:3, etc) or m:n (e.g., 2:3 or 3:4, where m and n are very small integers).  For example, most joggers have noticed that during steady-state jogging, their breathing tends to lock into a fixed relationship with the step cycle - with, say, 2 or 3 steps to each inhalation cycle, or perhaps 3 steps to 2 inhales.  To the runner it feels easier and less effortful - and it probably is (cf. Bramble and Carrier, 1983).

  Similar phenomena have been observed in the laboratory in various forms for over a hundred years (see references in Collier and Wright, 1995 and Treffner and Turvey, 1993). For example, in one recent study (Treffner and Turvey, 1993, Experiment 3) subjects were asked sit in a chair with their arms resting on the chair arms and then to swing a pendulum along each side of the chair.  The pendula were lengths of broomstick with varying weights attached at one end giving different natural frequencies, that is, different frequencies at which they would swing if suspended from a fulcrum and bumped.  This frequency is quite close to the rate at which the subjects swing them with least effort from their wrists.  They studied how each pendulum was swung just by itself and also when the other arm had a different pendulum to swing, for various pendulum combinations.  The authors found that when swinging two pendula, subjects had a strong tendency to entrain them to each other in simple ratios like 1:1, 2:1, 3:1 or 2:3 -- depending on how the natural frequencies of the two pendula differ.  Other, more complex ratios would occur occasionally (like 4 cycles on the left to 5 cycles on the right), but these were unstable and seemed to quickly slip over to simpler ratios, like either 2:3 or 1:1.

  This kind of self-entrainment, where one arm entrains the other, is not restricted just to cyclic activity like swinging a pendulum or waving a finger. Even a single, one-time gesture, if performed by two hands, also tends to exhibit entrainment between the hands.  Kelso, Southard and Goodman (1979) asked subjects to perform easy and hard reaching gestures with one or two arms, and to do so as quickly as they could without making many errors.  They sat at a table and put, for example, their right index finger on a spot with a touch sensor.  Then on a signal, they reached to their right along the table to touch either a nearby target area with a large touch sensor or else a small, more distant target area.  Of course, they found that subjects can perform the short, easy reach quite a bit faster than the harder more distant reach with either hand.  Now when they asked subjects to perform these two different reaches with both hands simultaneously (from the same start signal), they found the easy reach was strongly slowed down by the harder one.  Both gestures started and ended together and took the same amount of time.  It seemed that each hand entrained the other in the ratio 1:1 for the duration of the reach.  Kelso et al. interpret this as evidence about the attractor structure of the dynamical system that provides the cognitive mechanism coordinating the two gestures.  Coordinating the two reaches so that they are the same duration is apparently easier and more reliable than to allow (or force) each gesture to be independent of the other.

It is not claimed, of course, that this kind of primitive small-integer coupling is the only way the two arms can be moved.  With practice and appropriate motivation, Ss could presumably learn many possible relations between the arms.  But 1:1 mutual entrainment of the arms seems to be one of the simplest ways for humans to perform novel coordinated acts.    Other examples of the strong tendency toward self-entrainment may be observed when learning to play the musical instruments that require using both hands -- that is, instruments like guitar, violin, sax and drums (as opposed to a trumpet or trombone).  In these instruments, the novice must learn to control the phase relations of the two hands appropriately.  To produce an extended trill on a bongo drum, for example, the two hands must be kept at opposite phase even at very high rates.  For novice drummers, however, the hands tend to slip over to zero phase lag (where they hit the drum head simultaneously).  In the case of string instruments like the guitar or mandolin, the fingers of the left hand must clamp onto the fingerboard just before the plectrum strokes the string with the right hand.  It seems that for a novice player, there is a tendency for the left hand finger to slap down on the string simultaneously with the plectrum snap across the string.  Of course, you don't get a good sound in this case because the plectrum excites the string while it is still partially dampened by the soft flesh of the finger tip.  If they keep practicing, learners eventually get the phase offset correct and can then produce fast runs and arpeggios.  To play the piano in various styles there may be rather different phase relationships that are typical of each genre.  Compare playing a military march on the piano, where a feature of the style is that the left and right hands frequently strike the keys simultaneously, with playing boogie-woogie, where the left hand beats out a steady pattern on the main beat while the right hand operates quite independently with comparatively few of its strokes being in phase with the left hand.  Getting the knack of such a style of performance requires decoupling the two arms from each other.  It seems that musical performance skills frequently involve careful control over self-entrainment by parts of the body.

  Another characteristic of the self-entrainment of separate gestures is that any similarity between the gestures themselves tends to be automatically increased or exaggerated.  For example, the old task of rubbing the tummy and patting one's head is notoriously tricky without practice.  One reason it is hard is that, although each gesture by itself -- the rotary gesture and the pat -- is an easy and familiar skill, one discovers when performing them together, that they have very similar natural frequencies (at least if you are moving the whole arm at the elbow in both cases).   But the similarity of frequency leads to mutual entrainment which causes the different hand gestures to interfere with each other.  In contrast, if the frequencies are very different, for example, if you rest your arm on a table and tap one finger rapidly while rubbing the belly slowly.  Here there does not appear to be much interference (although if you keep it up, they will still tend to fall into some regular 1:n harmonic ratio with each other).  The novice guitar player has a similar difficulty; strokes with the left and right hand get merged or confused with each other.  These familiar examples demonstrate other types of entrainment of one body part by another body part.  

Theoretical models of these phenomena invariably employ dynamical models for coupled oscillators (Yamanishi, Wakato and Suzuki, 1979; Kelso, 1995).  Each gesture is described by an equation specifying a vector field within which the system state moves.  But the equation for each limb must include a term reflecting the current state of the other limb.  For example, the `sine circle map’ (Glass and Mackey, 1988) shows why simple rational relations (e.g., 1:1, 1:2, 2:3, etc) should be more stable than other relationships.  It is the `coupling terms' in the oscillatory equations that account for how each affects the other.  The state space of two oscillators is a torus with one circle for each oscillator. The attractor trajectories of coupled oscillators are closed paths on the torus surface (Abraham and Shaw, 1982).

 

 Perceptual Self-entrainment

  For the cases mentioned above, one might suppose that the coupling is due largely to the physical link between these physical oscillators: the legs and trunk in the jogging case, or the two arms in the pendulum data.  The observed coupling effect could be due merely to physical forces - and thus arguably to have little relevance to theories of cognition.  However, nearly identical effects have also been observed where one of the oscillations involved has vanishingly small stimulus energy so that the coupling is strictly `informational' -- that is, made available by the auditory or visual system -- and is not due to actual forces.  In one experiment (Schmidt, Carello and Turvey, 1990), subjects swung one leg from a seated position on the edge of a table.  They were asked to watch another subject sitting next to them on the table and to swing their leg either in phase or out of phase with the other person at various frequencies. At slow rates, the subjects were able to keep their phase close to the assigned values of 0 or 0.5 with respect to each other.  However, they showed a strong tendency to fall from the out-of-phase pattern (one leg forward, the other back) to the in-phase pattern when rate was increased.  So, although nothing but visual information links the two systems, the behavior is exactly the same as the behavior we discussed when the novice tries to produce a trill on a bongo drum.  It is also identical to what is observed in the well-studied laboratory task where subjects wag their two index fingers in phase and out of phase at various rates (Haken, Kelso and Bunz, 1985; Kelso, 1995).

  Apparently it makes little difference whether the two limbs are in the same body or in different bodies (where only stimulus information could account for coupling).  The entrainment phenomena include cases of entrainment between the visual system and motor control for limbs.  From this perspective, the common tendency to tap our foot or nod our heads to music is a further example of self-entrainment between the auditory system and the motor system.

  The conclusion we draw from both the experimental and the anecdotal observations is that interaction based on physical forces between independent oscillators is not the primary mechanism to account for the widespread occurrence of mutual entrainment in human and animal behavior. The coupling of different kinds of oscillation in the nervous system is an intrinsic and ubiquitous property of cognitive systems.  Its fundamental role is presumably to organize and coordinate in time the activities of these disparate systems.  Given this, we might expect to see broad exploitation of this property in many other aspects of human cognition -- especially, in such complex temporal activities as speech communication.

 

Meter as Self Entrainment

The term meter, as used in music and in this essay, refers to temporal structures containing one or more levels of periodicity. The simplest case has a single level of beats with no other hierarchical levels:  tap, tap, tap, tap, … For typical cases in sophisticated musical systems throughout the world, the meter is nested or hierarchical with, e.g., 3 beats per measure and 8 measures per line of a song (which then has at least 3 levels: beats, measures and lines).  Meter is a fundamental basis for most musical systems, but may also be found outside music per se.  Notice that this use of the term contrasts with its use in modern phonology where meter is a structure defined in terms of serially ordered symbols (e.g., Hayes, 1995); meter here  requires continuous time.

One simple mechanism that can produce or model metrical structures is a set of coupled oscillators (Large and Jones, 1999; Gasser, Eck and Port, 1999; Todd, 1994; Port and Leary, in press).   The coupling must be strong enough that the oscillators stay in proper phase with respect to each other.  The beats, for example, need to have the same duration in every measure.  Musical meters, then, whether 2:1 or 3:1 or (2:1 + 4:1), implicate two (or more) mutually entrained oscillators that are implemented somehow in the nervous system.  Nested or hierarchical metrical structures imply a particular kind of self-entrainment – one part of the system oscillates and couples with another part of the system.  If we find that metrical behavior is somehow natural or intrinsic to human speech, we will have a way to begin to understand the temporal properties of speech that is more comprehensive and global than merely to add up the durations of a string of symbol-like units.

 

 Self-entrainment in Speech Timing

 The orientation we propose to account for the global temporal structure of speech is to look for various kinds of entrainment between auditory information and motor control as well as within the auditory and motor systems.  This is compatible with a goal for phonetics and phonology of specifying the information and skills speakers and listeners employ in producing and understanding speech in various languages.  It is not compatible, however, with such additional assumptions of modern phonology as that this information will have an essentially static form consisting of segments and features organized into static hierarchical trees.  Although linguists have long sought support for static units of speech (Jakobson, Fant and Halle, 1952, Chomsky and Halle 1968, Ladefoged, 1972), phoneticians have noted many difficulties with such attempts (eg, Lisker and Abramson, 1971; Port and Dalby, 1982; Port, Cummins and McAuley, 1995).   It seems likely that by addressing continuous time directly using dynamical models and notions like entrainment, we will, in time, find a clearer understanding of phonetics and phonology than by trying to avoid time by looking only for static segments and structures.  Sound segments may not be the simplest units of speech to understand after all.

What kind of temporal events might be simple?  What kind would be easy to build a recognizer for?  The traditional hunch (Chomsky and Halle’s hunch, of course) was that spectrum slices (rather like the distinctive features in Preliminaries to Speech Analysis, Jakobson, Fant and Halle, 1952; Blumstein and Stevens, 1979) would be simple to build a recognizer for.  Events that last longer than this, they guessed, would be harder to recognize and, in any case, would require specification of these atoms in order to be defined.  In contrast, the new suggestion is that the simplest temporal event to recognize is a basic repetitive event, an `auditory same' that recurs on a regular schedule.  The most primitive `same’ is likely to be an energy onset. Such a recurring identity, if supported by an oscillating predictor (Large and Jones, 1999; McAuley, 1998), might serve the function of dividing time into primitive time intervals.

  A number of kinds of time-like units have been proposed for speech in various languages: inter-stress intervals in English, the mora in Japanese, the bisyllabic foot in Finnish and Estonian.  The notorious difficulty with these units is that they are not as regular as one would hope.   Certainly they could not be called `isochronous’. This has discouraged many phoneticians from looking for temporal regularities (van Santen, 1996). But, of course, what is easy or hard will depend on what method is employed to do the measurements.  One thing that is clear by now is that a clock measuring in milliseconds is probably not a very appropriate or useful mechanism (see Port, Cummins, McAuley, 1995).

One kind of evidence that would support the notion of meter in speech would be the discovery that syllable onsets (e.g., moras or stressed syllable onsets) are evenly spaced in ordinary spoken prose.  Although there is evidence in support of regular spacing of moras in Japanese (Port, Dalby and O'Dell, 1987; Han, 1994), attempts to find isochronous interstress intervals in English (as proposed, e.g., by Pike, 1945 and Abercrombie, 1967) have not been very successful (Bolinger, 1965; Lehiste, 1977, Dauer, 1983).   Finding a simple, single-level meter implies self-entrainment only if we point out that speech gestures (e.g., by the tongue, lip, jaw and velum) are entrained to the oscillator that generates (or perceptually models) the abstract periodicity.

  A second place to look for meter, especially the nested or hierarchical kind, is in songs or poetry (see, e.g., Boomsliter and Creel, 1977).  Poetry is typically nested, using beats, hemistiches, lines and stanzas. Notice that if we are looking for temporal effects, only actual performance of poems will provide relevant data, not the kind of symbolic descriptions of performance (such as phonetic or orthographic transcriptions) customarily used by linguists.  Similarly, to look seriously at temporal events requires differentiating many styles of speech, since any piece of text, whether poetry or prose, can be pronounced with a wide range of possible rhythmic and intonational styles.  The `linguistic competence' for speech rhythm cannot be investigated with symbolic descriptions, but intrinsically requires audio recordings or other continuous-time format for analysis.

  Aside from music and poetry recordings, a simple speaking task that is easily amenable to laboratory work is to create an artificial speech style involving repetition of a short piece of text.

It turns out that such repetition, especially if supplemented with an external periodic stimulus, strongly encourages entrainment between two hierarchical levels of speech timing.  To propose an extreme case, we might ask subjects to pound a large hammer on a block of wood while repeating some phrase: "He POUNDed the HAMmer, He POUNDed the HAMmer...".  Or speakers might be marched along a treadmill while repeating a phrase ``One, two, three, four; One, two, three, four, …’’.  If we find speech timing to be easily entrained to such gross nonspeech actions, noone would be surprised although its relevance for natural language would be questioned.  But what if we take a gentle approach and ask subjects to sit still and simply repeat a phrase over and over?  Subjects might hear a metronome that periodically signals them to produce some phrase or sentence.  Will we then still find that prominent events in the phrase – such as the stressed syllables that initiate metrical feet – tend to be located at harmonic fractions  (that is, ½, 1/3, 2/3, ¼, ¾, etc.) of the repetition cycle?  The metronome signal produces a periodic event that might tend to entrain any possible harmonically related periodicities within the produced speech. If this kind of self-entrainment should turn out to occur easily (or even unavoidably), then we might (a) conclude that the simple repetition task has induced metrical performance of speech and (b)  interpret this effect as evidence that normal speech production typically employs timing mechanisms that fundamentally resemble oscillatory or quasi-oscillatory systems.  We reason that only if such dynamical, oscillator-like behavior were implicit in the temporal control system of normal conversational speech could a simple periodic task lead immediately to entrainment to harmonic frequencies.

 

Rhythmic Patterns in Speech

`Harmonic Timing Effect.’

Speech cycling refers to a broad class of tasks where subjects repeat pieces of text over and over.  The perceptual center (P-center) experiments explored in the 1980s are an example where the focus was on repetition of brief monosyllabic words (Morton, Marcus and Frankish, 1976; Fowler, 1983; Scott, 1993).  The speech cycling technique was further developed by R. Port, Tajima and Cummins (e.g., Port, Cummins and Gasser, 1995; Cummins and Port, 1998; Cummins, 1997; Tajima, 1998; Tajima and Port, in press) as a tool for studying global speech timing.  The most fundamental result of our experiments has been the repeated demonstration that human speakers have a very strong tendency, when repeating a piece of text, to choose (or discover) a rhythmical performance style for the text that places prominent syllables at harmonic fractions of the repetition cycle. Thus, for a short text fragment like `Take a pack of cards’, containing just three prominent syllables that can receive a pitch accent, English speakers will tend to repeat it in one of only three patterns. In the most common pattern, take and cards receive pitch accents and (calling the interval from one take to the next, the repetition cycle) the onset of the word cards occurs half way through the repetition cycle.  The cycle has been divided into halves. To perform this pattern, readers are encouraged to tap their left hand on the knee for each take, and the right hand on the knee for each cards. Just alternate the hand taps evenly while saying the phrase.  In the second repetition pattern, take, pack and cards all receive accents and the repetition cycle is divided into thirds so that the onset of take, pack and cards are equally spaced. Readers should be able to find this mode by repeating ``1, 2, 3; 1, 2, 3; Take a pack of cards; Take a pack of cards; …’’ Readers, no matter what their native language, should find it easy to reproduce these two patterns. The third pattern is also a three beat pattern but the beats are at a much slower tempo. Here take is again on beat 1 and cards occurs on beat 2 while beat 3 is silent, leaving a fairly long gap between repetitions.

The powerful preference for just these three timing patterns was verified (Cummins and Port, 1998) in experiments where subjects were asked to repeat phrases similar in structure to the one above, while placing the final stressed syllable onset at randomly selected temporal locations within the cycle from 20% to 70% of the way through the cycle. The target phase angle was indicated by using a computer generated tone at randomly chosen locations. The cycle period was adjusted so as not to strain speaking rate. The subjects were asked to put the first word on the higher-pitched metronome pulse and the final word (all monosyllabic) on the low-pitched metronome pulse (which was randomly located from trial to trial). On each trial they cycled the same pattern 8 times in a row on a single exhale (since we found that attempts at breathing interfere with the timing).

 

 

Figure 1. Speech cycling results from Cummins & Port (1998). A corpus of 30 phrases with identical prosodic structure was used. Each phrase was in the form of “X for a Y” with stop-initial monosyllabic words, like beg for a dime. Subjects repeated the phrases 8 times per trial to a metronome with alternating high and low (H and L) tones. The H marked the beginning of the cycle and the phrase (e.g., beg) and the L marked the target location for the final word (e.g., dime).  There were 14 trials for each subject. The interval from the H to L was fixed at 700ms and the interval from L to H was varied from trial to trial such that the H-L interval varied randomly over the range 0.2 to 0.7 of the H-H interval without requiring any change in speaking rate for the test phrase.  Subjects were instructed to line up the first and last words with the low tones. If there were no rhythmic constraints on speech production, subjects should have produced the onset of the final stress near the target phase angle (except for the effects of noise) and the distribution should appear just as flat as the distribution of target times. Instead, the histograms are strongly trimodal, with modes near 0.33, 0.5, and 0.66, except that subject KA showed only two modes at 0.3, and 0.5. (Used with permission.)

 

This is the basic ``harmonic timing’’ result.  Subjects have an extremely strong preference to locate prominent syllable onsets near harmonic fractions of a repetition cycle.  The three modes here illustrate the three patterns of phrase demonstrated above. The effect can be found, not only in experimentally produced speech tasks, but also when crowds chant (e.g., ``We want Willie; We want Willie, …’’ or in the American children’s singsong chant ``nyaa nyaa nya-nyaa nyaa’’) or when a piece of familiar text is repeated by a group (e.g., the Pledge of Allegiance, `responsive reading’ or group prayer in a religious service) and is found occasionally in spontaneous speech, especially by professional speakers like preachers and news broadcasters (especially in paragraph punch lines or in signoffs).

 For our purposes here, the thing to notice is that harmonic timing is probably a particular instance of self-entrainment where one cognitive oscillator cycles at the phrase repetition rate and a second oscillator, coupled tightly with the first, cycles at either 2 or 3 times the phrase repetition rate.  These two oscillators must be coupled since they go through phase 0 at the same time for every cycle of the slow one and every second (or third) cycle of the faster one.  Presumably, phase 0 of both oscillators attracts syllable onsets, especially stressed syllable onsets (at least in English). Indeed if a syllable is not stressed but occurs at a harmonic fraction, it tends to acquire a stress. For example, compare 2-beat and 3-beat repetitions of  `Take those cards.  In the 3-beat reading, the word those tends to be much more strongly accented than in the 2-beat reading.  Even the intonation of the phrase is influenced by the meter selected.

Is this behavior a universal of human speech?  I strongly suspect that it is. We have found evidence of preference for these patterns in Japanese (Tajima, 1998; Tajima and Port, in press), Arabic (Tajima, Zawaydeh, Kitahara, 1999) and Finnish (unpublished data). This does not mean, however, that speakers of all languages behave exactly the same. Our informal and anecdotal results suggest, for example, that speakers of Japanese and Finnish find it quite natural to repeat an English 5-syllable phrase like Toss those books right back with a 5-beat pattern (for the 5 equally spaced syllable onsets), whereas English speakers find that pattern awkward and difficult.  English speakers will try to repeat it with either 3 beats (on Toss, books and back) or 2 beats (on Toss and back) or 6 beats (with all words getting equal stress but with a delay for the 6th beat between back and Toss. 

Rhythmic production of speech occurs in song, poetry and chant in seemingly every human culture. And even without trying to speak rhythmically, speakers often do so anyway –usually without noticing what they are doing[2].

 

Cross-language Experiments.

Recently, in order to explore the ways in which speakers of different languages exhibit the harmonic timing effect, Keiichi Tajima and I attempted to develop an analogous set of tasks for speakers of both Japanese and English (Tajima and Port, in press).  We hoped to see both similarities and differences in the speakers’ behavior in the two languages.

In comparing English and Japanese, it was expected that English speakers would show a greater tendency to place prominent, stressed syllables close to harmonic fractions of the repetition cycle during speech cycling than would Japanese speakers. These predictions follow from the traditional description of English as a `stress-timed’ language and Japanese as `mora-timed’.  To test this hypothesis, phrases were constructed containing segments that should tend to push the target-syllable onset forward or backward in time by using vowel sequences that were long-short and short-long. These syllable pairs then occurred both at the beginning of a foot (with stress on the first  vowel) and in the middle of a foot (with stress on the second vowel). It was expected (a) the second vowel of each pair would begin earlier when it was longer and later when it was shorter, and (b) that the effect of the reversing the vowel lengths would be reduced when the second vowel began at a beat of the rhythmic pattern of the phrase. The upper panel of Figure 2, shows one of four word sets that English speakers were asked to read with a waltz-like 3-beat pattern.  Since /aI/ is normally much longer in duration than //, it was predicted that the /aI/ vs. // reversal would have less effect in C vs. D than in A vs. B. The reason is that the release of the /g/ in C and D is a stressed, syllable-initial event, while in A and B it is just another foot-internal syllable.

The Japanese materials were trickier to design since Japanese does not have stress.  So we created especially prominent syllables by using two-syllable, two-mora words that had a pitch accent on the first syllable – as shown in the lower panel of Figure 2. Several researchers (Poser, 1984; Tajima, 1998) have found evidence that Japanese has a preference for a bimoraic unit.  We reinforce this natural unit with word boundaries and with the word-initial pitch accent to create a situation that would resemble English stress at least to the extent that it might be more attracted to a perceptual beat than the other case.  Again, the items in A and C differ from B and D by reversing the order of /i/ and /a/ (and it is wellknown that the /i/ has somewhat shorter duration than the /a/) and A,B differ from C,D by having the reversed order within the same bisyllabic unit rather than across boundaries between bisyllabic units. Thus, in Japanese as well as in English, we predict that A and B will be more affected by the vowel duration asymmetry than C and D. Further, although comparing between two languages is always problematic since true minimal pairs cannot be examined, we might expect English to show more effect of the vowel reversal in C and D since the durational difference between English /aI/ and // is much larger than between Japanese /i/ and /a.  On the other hand, as the archetypal `stress-timed’ language (Abercrombie, 1967), we might expect English to be less influenced by the vowel reversals in C and D than Japanese, the archetypal mora-timed language (Port, Dalby and O’Dell, 1987).

In the experiment, native speaking American English (n = 14) and Japanese (n = 13) subjects were instructed to repeat these phrases in a waltz-like 3-beat pattern, in time with a metronome signal that had a period of 1.2 seconds for the English speakers and 1.1 or 1.0 s. for the Japanese speakers (since piloting showed the task was easier at these rates).  They cycled each phrase in trials of 8 repetitions on a single breath. With 3 word sets in each condition and measuring 6 tokens from each trial, yields roughly 300 measurements per condition.

Figure 2. Examples of the English phrases used in speech cycling task with dotted lines marking the 3 beats. A and B differ in the order of shorter and longer vowels just like C and D. But  in A and B the prominent (or stressed) syllable is the first in the reversed pair  precedes the vowel reversal.  In C and D, the vowel reversal spans the stress location. It was predicted that the vowel switch would have more effect in A vs. B than in C vs. D. The Japanese materials have analogous properties except that, lacking stress, word-initial and bimora-initial syllables with initial pitch accents simulate the effect of stress in English.

 

We used a semi-automated procedure to locate the perceptual `beats’, that is, approximately the vowel onsets in the second target syllable (approximating the P-center measure of Scott, 1993; Cummins, 1997; Cummins and Port, 1998). They were measured in milliseconds and are presented in ms. and as fractions of the speech repetition cycle. We found that, as predicted, both languages resisted allowing the vowel reversal to move the onset of the second vowel when that onset was at the beginning of a stress foot in English or bimoraic unit and word in Japanese. From Table 1, the onset of the target syllable vowel (after the /g/) was on average 66 ms later in A than in B but only 35 ms later in C than in D. In Japanese, the second vowel (after the second /k/) began 38 ms later in baki than in bika when the first syllable was on the beat, but when the beat was on the second syllable, the second vowel began only 17 ms later. In the two languages, the prominent syllable onset occurring on one of the beats of the 3-beat pattern was not moveable more than about  3% of the repetition cycle. The boundaries that did not lie on the simple harmonic fraction move over 4% in Japanese and 8% in English. These displacement measures show that a prominent syllable is attracted toward harmonic fractions in both languages.

Table 1

Target syllable displacement

Target within a foot

    (A-B)

Target between feet             (C-D)

 

English   milliseconds

       % metronome period

66 ms

 (8.2)

35 ms

(3.2)

 

Japanese    milliseconds

  approx  % metronome period

38 ms

(4.3)

17 ms

(1.6)

 

Table 1. Mean displacement of target syllable onset due to order reversal of the short and long syllables, expressed in ms and in approximate percent of the metronome period, across all speakers.

 

The effects of vowel order were statistically significant in both English and Japanese, confirming our prediction that the intrinsic duration of vowels would affect syllable length. We also found that prominent syllable onsets were more strongly attracted to the harmonic fractions of the phrase repetition cycle than the less prominent syllables.  The attractiveness of temporal location at one third of the way through the cycle is compatible with a hypothesis that a pair of coupled oscillators at frequencies 1 and 3 influence the timing of speech production in both languages.   The results suggest that speakers of both languages are entraining their speech to these  abstract oscillators by allowing prominent syllables to be attracted to phase zeros of the oscillators.  On the other hand, the fact that English speakers appear to be more attracted to this time location than Japanese speakers might reflect a difference in timingconstraints between the two languages. Of course, this result might reflect the fact that English /aI/ and // are more different in duration than Japanese /i/ vs/ /a/.  However a second experiment reported in Tajima and Port (in press) makes the second interpretation much less likely than the first.

 

Concluding Discussion 

    It appears that speech timing at the phrase and foot level has many similarities to other kinds of motor behavior.  One of the most striking of these similarities is self-entrainment. It seems that self-entrainment, or the tendency of one cyclic pattern to draw other patterns into tight temporal relationships such as harmonically related temporal ratios, may be a natural stylistic feature in human (and occasionally animal) activity – occurring both in motor control and perception and both in speech and nonspeech activity. This behavioral property may be a natural and nearly unavoidable temporal constraint for systems of oscillators that can influence each other.

  Thus is it likely that oscillator-like systems – or systems that can naturally exhibit oscillatory behavior – play a central role in the timing control of speech just as much as in the limbs. Not just in motor control, but probably in perception as well. It may be true not only of repetitive, cyclical patterns, but of many apparently noncyclic ones as well.  If this is so, then choosing to focus first on the cyclic performance of speech should give us insight into non-cyclic performance of speech as well.  Our expectation is that related effects will be found in the speech of most (and perhaps all) natural languages and will turn up in some form elsewhere than just in prosody –possibly even in syntax and other cognitive skills under certain conditions. 

Returning finally to the issue of the goals of phonetic and phonological theory, there appear to be some kinds of universals and some kinds of linguistic knowledge that are not plausibly described by inventories of symbols that can be manipulated in serial-order time – like beads on a string. There appears to be much else going on in the grammars of languages and in the universals of linguistic behavior that is overlooked and rendered invisible by the insistence that language be treated only as a formal system.

 

    References

 Abraham, Ralph and  Christopher Shaw (1982) Dynamics – A Visual Introduction. Vols 1-4 (Santa Cruz, CA, Ariel Press)

Abercrombie, David (1967) Elements of General Phonetics. Aldine Pub. Co, Chicago

Bernstein, N. (1967) The Coordination and Regulation of Movement. Pergamon, London.  

Blumstein, Sheila and Kenneth N. Stevens (1979) Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants. J. Acous. Soc. Amer. 66, 1001-1017.

Boomsliter, P. C. and W. Creel (1977) The secret springs: Housman's outline on metrical rhythm and language.  Language and Style 10, 296-323. 

Bolinger, Dwight (1965) Pitch accent and sentence rhythm. In Abe, Isamu and Tetsuya Kanekiyo (eds.)  Forms of English: Accent, Morpheme, Order (Harvard U. Press;Cambridge) 139-180.

Bramble, D. M. and D. R. Carrier (1983)  Running and breathing in mammals. Science 219, 251-256.   

Chomsky, N. and M. Halle (1968) Sound Pattern of English. Harper-Row, New York.   

Clark, Andy (1997) Being There: Putting Body, Brain and World Together Again. (Cambridge: MITP)

Collier, Geoffry L. and Charles E. Wright (1995) Temporal rescaling of simple and complex ratios in rhythmic tapping.  JEP: Hum Perc & Perf, 602-627.  

Cummins, Fred. (1997). Rhythmic coordination in English speech: An experimental study. Doctoral dissertation, Indiana University.  

Cummins, Fred & Robert F. Port. (1998). Rhythmic constraints on stress timing in English. Journal of Phonetics, 26, 145-171.

Dauer, Rebecca (1983)  Stress-timing and syllable-timing reanalyzed. J. Phonetics 11, 51-62.   

Fowler, Carol (1983) Converging sources of evidence on spoken and perceived rhythms of speech: Cyclic production of vowels in monosyllabic stress feet. Journal of Experimental Psychology: General 102, 386-412.

Gasser, M., D.Eck and R. Port (1999) Meter as mechanism: A neural network that learns metrical patterns. Connection Science 11, 187-216.

Glass, Leon and Michael Mackey (1988)  From Clocks to Chaos.  Princeton Univ. Press, Princeton, NJ

Haken, H., Kelso, J. A. S., & Bunz, H. (1985). A theoretical model of phase transitions in human hand movements. Biological Cybernetics, 51, 347-356.

Han, Mieko S. (1994). Acoustic manifestations of mora timing in Japanese. Journal of the Acoustical  Society of America, 96, 73-82.

Hayes, Bruce. (1995). Metrical stress theory: Principles and case studies. Chicago: University of Chicago Press.

Jakobson, R., G. Fant and M. Halle (1952)  Preliminaries to Speech Analysis.  MIT Press, 1952/ 1963.  

Kelso, J. A. Scott (1995 Dynamic Patterns: The Self-Organization of Brain and Behavior. Bradford Books, MIT Press. (Cambridge).   

Kelso, J. A. Scott, Mingzhou Ding and Gregor Schöner (1993) Dynamic pattern formation: A primer. In L. B. Smith and E. Thelen (eds) A Dynamic Systems Approach to Development: Applications. (Bradford/MITP; Cambridge, MA), pp. 13-50.

Kelso, J. A. S., D. Southard and D. Goodman (1979) On the nature of human interlimb coordination.  Science 203, 1029-1031.  

Kitahara, Zawaydeh  and deJong

Ladefoged, Peter (1972) A Course in Phonetics. Harcourt Brace Jovanovich.

Large, Edward and Mari Jones (1999) The dynamics of attending: How we track time varying events. Psychological Review 106, 119-159.

Lehiste, Ilse (1977) Isochrony reconsidered. J. Phonetics 5,253-263.  

Lisker, Leigh and Arthur Abramson (1971) Distinctive features and laryngeal control.  Language 47, 767-785.  

Manaster Ramer, Alexis (1996a) A letter from an incompletely neutral phonologist. J. Phonetics 24, 477-489.

Manaster-Ramer, Alexis (1996) Report on Alexis’ dreams – bad as well as good. J. Phonetics 24, 513-519.

McAuley, Devin (1998) Effect of deviations from temporal expectations on tempo discrimination of isochronous tone sequences. JEP: Hum Percptn and Performce 24, 1786-1800.

Morton,  J., S. M. Marcus and C. Frankish (1976) Perceptual centers (P-centers).  Psychological Review 83, 405-408.

Norton, Alec (1995) Dynamics: An introduction.  In R. Port and T. van Gelder (eds) Mind as Motion: Explorations in the Dynamics of Cognition (Bradford/MITP; Cambridge, MA), pp 45-68.

Pike, Kenneth (1945) The Intonation of American English. Univ. of Michigan Press.

Port, Robert (1996) The discreteness of phonetic elements and formal linguistics: Response to A. Manaster Ramer. J. Phonetics 24, 491-511.

Port, Robert and Jonathan Dalby (1982) C/V ratio as a cue for voicing in English.  Perception and Psychophysics 2, 141-152.   

Port, Robert, Fred Cummins and Michael Gasser (1995) A dynamic approach to rhythm in language: Toward a temporal phonology.  In B. Luka and B. Need (eds)  Proceedings of the Chicago Linguistics Society, 1996 (Department of Linguistics, University of Chicago), pp. 375-397.   

Port, Robert, Jonathan Dalby and Michael O'Dell (1987) Evidence for mora timing in Japanese. J. Acous. Soc. Amer. 81, 1574-1585.   

Port, Robert and Adam Leary (2000, in press) Speech timing and linguistics theory.  In Carolyn Drake (ed.)  Time in Audition.  (Univ. Descartes; Paris).

Port, Robert F., Fred Cummins, and J. Devin McAuley. (1995) Naive time, temporal patterns and human audition.  In Robert F. Port and Timothy van Gelder, editors, Mind as Motion: Explorations in the Dynamics of Cognition.. MIT Press, Cambridge, MA, 1995, pp. 339-372.

Port, Robert and Timothy van Gelder (1995) Mind as Motion: Explorations in the Dynamics of Cognition.  (Bradford Books/MIT Press, Cambridge, MA).    

Poser, William (1984) The Phonetics and Phonology of Tone and Intonation in Japanese. Unpublished Doctoral Dissertation, MIT.

Schmidt, R. C., C. Carello and M. T. Turvey (1990) Phase transition and critical fluctuations in the visual coordination of rhythmic movements between people.  JEP: Hum Percp & Perf 16, 227-247.  

Scott, S. K. (1993) P-centers in Speech: An acoustic analysis. Unpublished doctoral dissertation. University College, London.

Tajima, Keiichi. (1998) Speech Rhythm in English and Japanese: Experiments in Speech Cycling. Doctoral dissertation,

Tajima, Keiichi and Robert Port (2000) Speech rhythm in English and Japanese. In John Local, editor, Papers in Laboratory Phonology VI. Cambridge University Press, Cambridge.

Tajima, Keiichi, Bushra A. Zawaydeh, and Mafuyu Kitahara. A comparative study of speech rhythm in Arabic, English, and Japanese. (1999) Proceedings of the XIVth International Congress of Phonetic Sciences, San Francisco, CA, pages 285-288

Thelen, Esther and Linda Smith (1994) A Dynamics Systems Approach to the Development of Cognition and Action. (Bradford/MITP; Cambridge, MA).

Todd, Neil P. McAngus (1994) The auditory `primal sketch’: A multiscale model of rhythmic grouping. J. of New Music Research 23, 25-70.

Treffner, Paul and M. T. Turvey (1993)  Resonance constraints on rhythmic movement. JEP: Hum. Percp & Perf, 19 1221-1237.  

Turvey, Michael T. (1990)  Coordination. American Psychology 45, 938-953. 

van Gelder, Tim and Robert Port (1995) It’s about time. In R. Port and T. van Gelder (eds)   Mind as Motion: Explorations in the Dynamics of Cognition. (Bradford Books/MITP) pp. 1-44.

van Santen, J.P.H. (1996). Segmental duration and speech timing. In Yoshinori Sagisaka, Nick Campbell, & Norio Higuchi (Editors), Computing prosody: Computational models for Processing Spontaneous Speech. Springer Verlag, New York.

Yamanishi, Junichi, Mitsuo Kawato and Ryoji Suzuki (1979) Two coupled oscillators as a model for the coordinated finger tapping by both hands. Biological Cybernetics 37, 219-225.



[1] To appear in Fabrice Cavoto (ed.) The Complete Linguist: A Collection of Papers in Honor of Alexis Manaster-Ramer. (LINCOM Europa: Munich, 2000).

[2] Readers are invited to explore the author's webpage with audio examples of rhythmically spoken speech: http://www.cs.indiana.edu-/rhythmsp/