Speech Timing and Linguistic Theory

 

by Robert F. Port and Adam P. Leary

Department of Linguistics, Indiana University

Bloomington, Indiana, 47405

port@indiana.edu, adamlear@indiana.edu

July 05, 2000

 

It is argued that the traditional computational or symbolic view of cognition has little choice but to assume an apriori phonetic space of discrete, static and serially ordered atomic symbols, as assumed explicitly by Chomsky and Halle (1968) and others.  Then it is shown that there are several, well-supported cases where these assumptions are shown to be false. First, there is a durational pattern observed in the English voicing contrast in syllable-coda position (e.g., lab/lap) where evidence shows that the relative duration of two intervals (the vowel duration and following stop or fricative duration) is a fundamental cue for the value of the voicing feature. Since this durational ratio cannot plausibly be assigned to a universal property of phonetic implementation, the durational ratio must be described as a property of English phonetics or phonology in violation of the Chomsky-Halle assumption of universal, static phonetic features.  Second, the incomplete neutralization of the voicing contrast in syllable-final position in German (where Bunde-bunte become Bund-bunt with near-but-not-perfect neutralization to the voiceless category) shows that the discreteness property also may also fail to hold.  These results imply that language processing is achieved by a system that may prefer discrete, symbol-like units, but does not require them.  Recent developments in models of speech perception and production processes dependent on dynamical systems, such as those of Grossberg and colleagues, exhibit the appropriate characteristics to serve as a psychology upon which a psychologically sound linguistic theory can be constructed that does not require assuming that human language is an instance of a mathematical or computational system.

 

Section 1: The problem of time in language.

 

No one would deny that speech is produced in time, that is, that the sentences, words, consonants and vowels of human language are always extended in time when they are uttered.  Still, many - and probably most - linguists would argue that temporal extension is not an intrinsic property of natural language and that the temporal patterns of language (other than those representable in terms of serial order) should not be expected to be relevant or revealing about language itself.  This might surprise many nonlinguists, but

it reflects a fundamental and far-reaching assumption about language and even about general human cognition.  Linguists tend to assume that the temporal layout of speech is a property that is imposed on language from the outside at the point where the logically static and serially ordered structures of the language itself are performed by the human body. That is, linguists typically assume that linguistic competence contains only static structures.  The tree-shaped patterns graphed on the pages of linguistics journals and serial lines of printed text like the ones you are reading on this page are, in essential respects, good models of our actual cognitive representations. The cognitive form of language is said to have serially ordered discrete words composed from a small inventory of meaningless sound-related segments, just like a printed page.  And a tree diagram represents the same logic as the phrase structures of any sentence.

 

These cognitive symbol strings may be `implemented' in time by the linguistic `performance' system if and when linguistic structures happen to be spoken. From the standard linguistic perspective it is the serial and hierarchical structure of language that reveals its true form. One might say that speech is language plus the filtering effects of the performance system that maps language into speech.  From the traditional point of view, speech performance is thus derivative. It is merely one possible `output mode'-- it is just one of several ways (along with writing) to get language from the mind out into the body and the world.  Speech, on this view, just happens to impose time on the fundamentally nontemporal structure of the language itself.

 

This point of view is basic to all 20th century structuralist views of language (de Saussure, 1916; Bloomfield, 1926; Hockett, 1954) including the Chomskyan paradigm (Chomsky, 1965; Chomsky-Halle, 1968). It assumes a fundamental distinction between the Formal and the Physical. On one hand, there is a Formal (or mental) World - the Competence World, where the serial order of hierarchies of timeless symbols provide the data structures of natural language.  Formal operations apply to these data structures sequentially just as they apply in a derivation in formal logic or mathematics.  Given the formal nature of the structures involved, any time interval that might happen to be required for the operations to take place is merely epiphenomenal. It is not directly relevant to the formal operations themselves. And on the other hand, there is a Physical World of brains and bodies belonging to speakers and listeners.   From the traditional perspective, the time-free structures of language are actually ``implemented'' and processed in time (see Scheutz, 1999 for further discussion of implementation). From the perspective of traditional linguistics, such implementation processes may hold some interest, but they are in no way the natural home of human language and, in any case, are not directly relevant to it.

 

 We believe this point of view is deeply mistaken. Why? Although there are many reasons (see Port & van Gelder, 1995), we will discuss just two.  The first is that the dichotomy of competence and performance, of mind and body, of formal and physical, creates a gulf that, once postulated, turns out to be impossible to span using the methods of empirical science.  This is surely one reason why linguists are not generally eager to study the methods or results of experimental psychology, speech science, neuroscience or even physics (since these time-dependent fields are so difficult to reconcile with language as a pure symbol system). They seem irrelevant. And correspondingly, this is why scientists from other disciplines frequently have difficulty understanding what linguists are doing.  Disciplines like neuroscience and much of cognitive psychology (though not all) lie across the formalist gulf from linguistics.  Thus far, no satisfactory way to bridge this conceptual gap has been found.  If one assumes that real cognitive events do not take place in space and time and that real physical events do, there is no obvious way to get them together.

 

To appreciate this problem, the first step is to review some familiar properties of a symbolic system. Although linguists, for example, assume the uniformly symbolic nature of language – at all levels from phonetic segments, to phonological units, morphemes, words, phrases and sentences – less attention has been paid to what properties a symbol token must exhibit.  In western science, symbols are employed in at least these three distinct domains: for doing mathematical reasoning, in computer hardware function and as a theory of cognition. Thus, for various types of mathematical reasoning, logic uses tokens like p and q, and  mathematics might use x and integers.  In formal reasoning (like doing logical proofs or long division, or writing a computer program, etc), operations are performed on symbolic structures as executed by trained human thinkers.  When the steps are difficult and throughout the training of practitioners, steps in a formal  reasoning process are typically supported by body-external props. That is, extended formal reasoning depends on external `scaffolding’ (see Clark, 1997) such as by writing physical symbol tokens on paper (and, very recently, by using the support of programs running on digital or analog computers).  In computer hardware, formal methods are automated in general terms by the use of symbol tokens coded into bits in a digital computer. The third domain for the employment of symbolic theory lies in a particular view of various cognitive operations involved in human language and human reasoning (Chomsky, 1965; Fodor, 1975; Newell & Simon, 1972). The symbol tokens in language (and probably in general cognition) are the words and sound components of a particular language. 

 

As clarified by Haugeland (1985), in order to function as advertised, the symbol tokens in any symbolic system must be, first of all, digital, that is, discretely distinct from each other and reliably recognizable by the available computational or cognitive equipment. This is essential in order for the computational mechanisms to manipulate the symbols infallibly during processing.  Second, the symbols used must be either apriori or composed from apriori components.  Some set of apriori units must be available at the time of origin of a symbolic system from which all further symbol structures are constructed. In the case of logic or mathematics, an initial set of specific units is simply postulated (e.g., the integers, proposition p or the initial symbol S). In computing, the bytes execute operations in discrete time, but the units and primitive operations were engineered into the hardware itself, and are thus obviously apriori.  Human language and everyday reasoning are more problematic since we don’t know what they are in advance. Indeed, the discovery of the list of innate primitive units is the one of the primary targets of research in modern linguistics (e.g., Chomsky, 1965). (The discrete time assumption is almost never discussed.) Most often linguists assume that the initial list of primitives includes at least units like [Vowel], [Consonant], [Noun], [Past Tense], [Sentence] and so forth. 

 

The third property of symbols, although one that Haugeland did not say much about, is that they must be static.  Since symbolic or computational models always function in discrete time, it must be the case that at each relevant discrete time point (that is, at each tick of the discrete-time clock), all relevant symbolic information is available. For example, if a rule is to apply that converts apical stops into a flap, then there must be some time point at which the features that figure in the rule, [+stop], [+voice], [+apical] etc. are all fully represented and either are holding steady or somehow are constrained to synchronize with each other while the rule applies in a single step of discrete time. Thus, properties in a symbolic system cannot unfold gradually in continuous time but must, at the relevant clock tick, have some discrete symbolic value.  (Of course, there is nothing to prevent simulation of continuous time with discrete sampling, but this is not what the symbolic, or computational, hypothesis about language claims.)

 

Now, how can symbolic units with these properties be naturalized and sought in a human brain? Formal symbols assume some very nonbiological properties. It is one thing for humans to manipulate such units in a deliberative way leaning on the support of paper and pencil so each step can be written down and checked for accuracy, and for computers to employ specialized discrete-time hardware to process symbolic structures.  But it is another matter altogether to casually assume that real formal symbol structures are actually processed in a discretized version of real time by human brains. The problem is that if we study language as a facet of actual physical human beings (rather than as a particular instance of a discrete-time `Platonic heaven’) then its processes and its products must have some location and extent in real time and real space.  Truly natural language must be visible and accessible to scientific research methods that investigate events in space and time – real events in real time.  If there is temporally discrete behavior of the human brain (which is certainly quite possible), then we must surely study this empirical phenomenon directly by gathering our data in continuous time in order to discover just where temporal discreteness can be observed and how the discrete-time performance is achieved.  Simply assuming there is a sharp apriori divide between language as a uniformly serial-time structure and speech as a real-time event (as claimed in Chomksy’s competence/performance distinction), is a very risky assumption to make.  And, in our view, there is a fair amount of evidence that it is simply false.

 

The second reason for rejecting the view that language is essentially formal is that it seems clear that, from a biological viewpoint, language is fundamentally and essentially a spoken medium not written one. Contrary to the practice of linguistic theory, it is the written and orthographic versions of language that are derivative from speech.  It is especially in written language where the symbol-like characteristics – like near-discreteness, timelessness and closed inventories of symbol tokens­ are most pronounced. Yet written (and edited) language is based upon historically recent, culture-dependent methods (dating back only a few thousand years), using cognitive processes that are partly dependent on literacy, logic and mathematical generalization. Even today fewer than half the human population is literate and only a minute fraction could appreciate the meaning of a diagram of the structure of a sentence or a syllable even if one spent some time explaining these images to them.

 

These seem to us to be real problems for the traditional view – ones that cannot simply be brushed off with assertions like ``We don’t really know much about how the brain works anyway.’’  For example, if every human utterance exists only by being built from clearly discrete building blocks, that is, if linguistic expressions are always discrete structures assembled from linguistic atoms (rather like the words and letters on this page are composed from inventories of very limited size), then why isn't it always similarly transparent to speakers (as well as to linguists) what the data structures actually are for any utterance in any language?  

 

Of course, we agree that in many specific cases it is intuitively very clear to speakers what the components are. For an English sentence like `Don kicked that', the morphemes are transparently just the set Don, kick, -ed and that.  And each of the phonemes that spell the cognitive representation of these morphemes seems to be a simple, atom-like sound unit  [dankikt...]. Furthermore, at the syntactic level as well, there is a fairly clear discrete phrase structure that represents [Don] as a unit in parallel with [kicked that] (serving as subject and predicate respectively), and we intuitively appreciate that the bond between, [kick] and [ed] is much tighter than the bond between [kicked] and [that].

 

But every linguist also knows that very often it is not obvious how many parts there are in a stretch of speech nor exactly what the parts are. Thus, it's not clear how many morphemes there are in such a simple word as `strawberry', or how many phonological parts there are in `chive'.  The piece `-berry' looks like an obvious morpheme, but what about `straw-'? It doesn't seem to be related semantically to `wheat straw' or `soda straw'  nor to contribute any specific meaning except as a kind of diacritic on `-berry'  (to keep strawberry distinct from raspberry and elderberry).  But if straw is only a diacritic, then why chop off berry in the first place?  Perhaps strawberry should be monomorphemic  (and listed independently in the cognitive lexicon) despite the fact that berry also occurs both as a single word and as part of other words with related meanings.  But, this solution ignores the generalization that the berry in strawberry looks like, and means roughly the same as, the isolated word berry. So should this redundancy be avoided? Well, it depends on the analyst’s theoretical assumptions about grammars, of course. 

 

And is `chive' made of 3 cognitive sound units, or 4 or 5 (since the initial consonant and the medial vowel have either two distinct parts or smooth gliding motions both acoustically and articulatorily)? Such problem cases occur frequently in every language.  It is often very difficult to parse phrases, isolate the segmental phonemes, the morphemes and other units from spoken or written language.  In these challenging but very typical cases the morphological analysis is not obvious at all. Instead, linguists must bring some theoretical assumptions to bear in order to justify one analysis over others. Two of the very first assumptions seem to be the Symbolic Language hypothesis (SL) and the Atomic Inventory hypothesis.

1.  The Symbolic Language Hypothesis (SL): Every utterance in every language is constructed entirely of discrete, timeless symbols. Complex units are composed  of combinations of simpler units in a series of levels.

 

2.  The Atomic Inventory Hypothesis (AI): All languages select their phonological (and other primitive) symbols from a universal discrete set – a set that is not very large.

All significant differences within the sound systems of any language as well as all differences between languages are supposed to be representable in this alphabet of symbols. Languages, it is claimed, can never differ in the continuous-valued implementation of these minimal symbols.  If they could, then languages would not be entirely symbolic.  (Notice that English orthography, like most other alphabetic orthographies, explicitly embodies a related assumption since all the words in correct sentences are constructed from a set of 26 letters and the words are supposed to be selected from a standard dictionary of vocabulary items.  Much more generally, linguists assume these properties are true of spoken language as well.)  For spoken language the discreteness of phonological atoms, that is, of the segmental distinctive feature vectors, guarantees the discreteness of all other linguistic units spelled from them.

 

Of course, it is possible that the SL Hypothesis is correct, but what if it turns out to be partly false? What if words and other apparent linguistic units are sometimes only nondiscretely different from each other (unlike printed letters)? And what if linguistic units like words and phonemes are not always timeless static objects, but turn out to be necessarily, essentially, events in time? If (a) genuine category nondiscreteness exists, or if (b) there exist any linguistic structures that are essentially temporal (as opposed to merely `implementationally temporal’), then the bold symbolic assumption would be seriously compromised.  Certainly the SL hypothesis would not seem to be so obviously correct that one would feel justified in clinging to it as a reliable guide whenever descriptive problems gets murky.

 

Instead, we propose that the first steps should be taken toward a new discipline of linguistics -- we might call it `Embodied Linguistics’  (even while risking scorn for such a trendy term!). Step one must be to naturalize language and fit it into a human body, that is, first of all, to cast it into the realm of space and time. The first step in doing this is to change our focus of attention from the study of linguistic knowledge (normally conceptualized as static and symbolic) toward the study of linguistic behavior and performance, since behavior and performances exist in time.  We take this step, not because of assumptions about learning (like the behaviorists did, Bloomfield, 1926), but simply because we need temporal information ­-- timing data – to discover how the whole system really works.  The cognitive system is assumed to run in time and can only be understood that way. The SL assumption deprives us of temporal data about human language and cognition.  Only through the study of specific language performances under many conditions can accurate understanding of the cognitive form of language arise.  Indeed, we will present evidence in this essay of both the nondiscreteness of particular linguistic patterns and also demonstrate one language-specific pattern that is essentially temporal.  These results clearly violate the Symbolic Language hypothesis and the Atomic Inventory hypothesis and support a view of language as something that is essentially embodied. Let’s turn to the evidence.

 

Speech timing research in general.

 

Research on speech production and perception has shown from the earliest era in the mid-50s that manipulation of aspects of speech timing could influence listeners’ perceptual experiences.  These aspects include the slope of energy onset in fricatives, the duration of acoustic intervals corresponding to consonant closures and vowels, the slope of formant frequencies and so on (see reviews in Lehiste, 1970; Klatt, 1976).  For example, chopping off the very front of the [S] in `shore’ with a waveform editor, so that the noise onset is very abrupt rather than gradual, creates the perceptual experience of an affricate, as in `chore'. Shortening the duration of a final [s] in `hiss’ (keeping the offset of noise intensity gradual by editing from the middle of the [s]) can create the perceptual experience of `his' for English speakers. (Klatt, 1976, reviewed many of these perceptual phenomena.) Of course, for languages with long and short vowels, like Hungarian [tør] tör ‘break’ vs. [tø:r] tQr ‘dagger’, if you lengthen a short one, it sounds like a long vowel and if you shorten a long one, it sounds like the word with a [short] vowel (for Hungarian listeners).  Clearly listeners in these cases are using some kind of measurement of onset slope and fricative or vowel duration extracted on the fly to make these judgments.  Similarly, speakers are controlling these properties during their production.  Do these phenomena pose a challenge to the traditional theory?  How could the traditionalists maintain that serial order is all that is linguistically relevant about the time axis?  Their answer is that there is both linguistic Competence, the cognitive symbolic system, and Performance, the system of psychological and physical constraints.

 

Traditionalists argue that sensitivity to these temporal patterns - even if they are exploited by listeners in speech perception – are not evidence about the language itself (Jakobson, Fant, & Halle, 1952; Halle & Stevens, 1980; Chomsky & Halle, 1968, pp. and see Lisker & Abramson, 1971).  Thus, in its symbolic form, the Hungarian words tör vs. tQr, for example, might differ in either the number of segments (that is, 1 /ø/ vs. 2 /ø/s) or possibly in whether a single vowel segment has the feature [+long] or not (that is, [`, -long] vs. [Q, +long]). The actual implementation in time is interpreted to be a consequence of universal processes of interpretation of the symbol external to the linguistic description per se.  This is a reasonable story.  However notice that this view still rests on an important theoretical claim about the phonetic implementation, one that may prove vulnerable.  The traditional account of speech timing is theoretically tenable only given the assumption that the phonetic implementation processes are universal. Only if phonetic implementation of discrete phonetic symbols works the same way for all languages could it be also true that linguistic utterances are composed entirely of symbols and differ from each other only in symbol-sized steps.  Since it is assumed languages cannot differ from each other in their phonetic implementation methods, the phonetic implementation must be universal.

 

The phonology is supposed to embody the language-specific properties of speech, while the phonetic inventory and its implementation is universal. It has to be, if the phonetic space is to include all the phonetic capabilities of the human species.  The fact that every language depends on the same phonetic inventory provides a key part of the Chomskyan account of why children can learn language so quickly (since the universal phonetic representation supports their ability to listen to and imitate their mother’s speech). This universal space is defined by a specific inventory of symbolic phonetic units that have serial order as their only version of the time scale (Chomsky & Halle, 1968).

 

Now what if it could be shown that what distinguishes a pair of words in some language (or a class of words that share a feature contrast) is intrinsically a temporal property, rather than something intrinsically nontemporal that happens to be implemented in time?  We have seen that the simple fact of differing in duration is not sufficient evidence since implementation rules might account for the duration effect. But, imagine a case where two sound classes differ from each other in some particular durational relationship and the duration of some acoustic segment (or articulatory segment) to some adjacent acoustic segment has to be, let us suppose, near some particular durational ratio.   If such a case could be found and if it could also be shown, in addition, that the temporal ratio shows no indication of being an implementational universal, then such a case would be strong evidence that temporal constraints can be intrinsic to specific languages. A static, durationless symbol would serve as a very poor model to describe such a situation. Most importantly, it would call into question the conventional isolation of time from linguistic cognition and would imply that the phonological grammars of particular languages incorporate some temporal properties.

 

Section II.  Germanic Voicing (or Tensity) Contrast: Temporal Aspects.

 

One of the best examples of such a language-specific temporal pattern used for phonological purposes is the contrast in English among stops and fricatives between those transcribed with /b, d, g, z, … / and those transcribed with /p, t, k, s, … /.  That is, we focus on a `feature' or a class of contrasts in the coda position of a syllable (that is, near the end of a syllable after the main vowel). In English, pairs of words like `lab-lap, build-built' and `rabid-rapid' contrast in this feature, as do German Bunde-bunte (club-Plur, colorful- Nom., Sing.), Swedish bred-brett (broad-Basic, broad-Neuter) and Icelandic baka-bakka  (to bake, burden-Acc).  This type of pattern was probably a characteristic of the ancient Germanic proto-language of 2-3 thousand years BP and has been inherited in somewhat different form by most modern Germanic languages.  The contrast is between pairs of stops and fricatives, like /z-s, b-p, d-t/, etc. at the end of a syllable or between syllables.  Because of the importance of several different articulatory and acoustic correlates of this feature, there has been some dispute over the years as to the proper characterization of the difference - whether it is essentially a feature of `Voicing' (that is, glottal closure and pulsing) or some complex of other factors usually summarized in the term `Tensity', where /s,p,t,…/ etc. are [+tense] and /z,b,d,…/ etc. are [!tense] or [lax].

 

It is clear that one characteristic of this contrast is that it depends significantly on timing to maintain the distinction.  In particular, for the words with Voiceless consonants, like English `lap' and `rapid', the preceding vowel is shorter while the stop closure is longer in the /p/ words (eg, `lap, rapid') relative to the corresponding words with /b/ (e.g., `lab, rabid') (Peterson & Lehiste, 1960; Lisker, 1985; Port, 1981).  Of course, since speakers typically talk at different speaking rates, the absolute durations of vowels and consonants are highly variable measures in ms. It is not the case that absolute durational values (eg, in milliseconds) are employed by listeners (since in that case they would both produce and  perceive more /p/s and /s/s at slow rates and more /b/s and /z/s at faster rates but they do not).  Instead, if all other properties are kept constant, the ratio of durations tends to be much less variable than, e.g., the absolute durations in ms, as shown in Figure 1.

 

Perceptual experiments with synthetically constructed speech or experimentally manipulated natural speech confirm that it is the relative durations that determine judgments between minimal pairs like /lab-lap/ and /rabid-rapid/ whenever other cues to the `tensity' feature are ambiguous (eg, Lisker, 1985; Port & Dalby, 1982).  Port has called this relationship ``V/C ratio’’ - the relative duration of a vowel to the following obstruent constriction duration.  This ratio is smaller for the [!voice] (or [+tense]) consonants than for the corresponding [+voice] (or [!tense]) consonants.  The V/C ratio is relatively (though not completely) invariant across changes in speaking rate, syllable stress and segmental context (Port, 1981, Port & Dalby, 1982).

 

Figure 1. Stimuli and results from Port & Dalby (1982) study of consonant and vowel timing as cues for voicing in English. The top panel shows sound spectrograms of two of the synthetic stimuli employed. These show the shortest  (140 ms) and longest  (260 ms) vowel durations for dib. For each vowel duration step, nine different silent medial-stop closure durations were used (= total of 40 stimuli). Subjects were asked to identify them as dibber or dipper. The lower panels show the percent identification as dibber as a function of medial stop closure duration (on the left) and  (on the right) as function of consonant/vowel ratio (where the stop duration of each stimulus is divided by its vowel duration). Looking at the perceptual boundary  (50%-50% identification) note the much tighter clustering in right panel around a C/V ratio of .35. 

 

In several other Germanic languages, similar measurements of speech production timing and perceptual  experiments using manipulations of V and C durations have shown that listeners pay attention especially to the relative duration of a vowel and the constriction duration of a following obstruent, that is, stop or fricative (Port & Mitleb, 1983; Pind, 1995).  So apparently this durational pattern is important for the production and perception of these Germanic languages.

 

Could the V/C Durational Ratio be a Temporal Universal?  The first thing to check when a durational invariant is found is to check whether this timing difference reflects universal and unavoidable articulatory correlate of a contrast in, say, glottal pulsing during a consonant constriction.  If this turned out to be the case, it would support the standard view that the phonology, specified in terms of patterns of universal phonetic features, is the locus of all differences between languages. `The phonetic capabilities of man', said Chomsky and Halle, are an inventory fixed for, at least, historical time. This species-wide alphabet is said to facilitate rapid language acquisition by infants and to assure that speakers of different languages can learn each other's language (given enough time).  So for this temporal pattern to show evidence of being universal, either in association with some segmental feature or not, would fit the view that it is only in the choice and deployment of static features that languages may differ from one another.  Any temporal differences that might show up can only occur as the result of the speech production or perception apparatus, which is, by hypothesis, a universal.

 

Phonetic Implementation.  To deal with these phenomena for speech production within the symbolic view of language, one might postulate an implementation system (outside language) that takes a (universal) alphabet of phonetic symbols as input and translates them into output gestures with particular temporal constraints. Individual features will differ in their effects on the temporal behavior of the output device.  In this case, the implementation rule would have to assure that the ratio of the vowel duration to the following consonant constriction is about 1.5 or greater for the +Voice case (so the vowel is quite a bit longer than the consonant constriction), and closer to 1 for the –Voice case (so the V and C are about the same duration).  Accounts along this line have been proposed for similar phenomena by, for example, Chomsky and Halle (1968, Chapter 10), Halle and Stevens (1980), Klatt (1976), Port (1981) and others. 

 

To see how such implementation systems might work, we will look more closely at Klatt’s (1976) model (also Port, 1981).  Klatt observed, first, that although many factors might influence the duration of a vowel, the effects of each factor depend in part on how many others apply at the same time.  For example, comparing the duration of the vowel in rabid vs. rapid (where the voicing feature will make the vowel shorter in rapid) with a similar difference in lab vs. lap, the vowel might be 12% longer in rabid than rapid but 18% longer in lab than lap.  The reason for the difference is that the vowel is overall longer in both lab and lap than in rabid and rapid due to the presence of an additional syllable in the rabid-rapid pair.  This nonlinearity led Klatt to propose that each vowel might have a minimum duration beyond which it could not be compressed. Then he applied constant ratio (that is, linear) timing rules to the remainder.  Thus each vowel had an `inherent duration’ – for [æ], let’s suppose, 100 ms observed from monosyllabic words with final voiced stop (as in lab or lag) spoken in isolation, and also a minimum duration – estimated as 65% of the inherent duration. Then constant ratio rules would lengthen or shorten the remaining interval (here 35 ms) between the inherent duration (here 100 ms) and the minimum duration.  Thus, to implement the vowel duration in, say, lacky, we take the 35 ms (inherent minus minimum) and multiply by 0.6 (to shorten for the following voiceless stop) and then by 0.7 (to shorten for the second syllable).  Then we add the resulting 14.7 ms (= 35*0.6*0.7) and add it to the minimum duration. This gives a target duration for [æ] in lacky or backing or rapid of 79.7 ms (± noise). By this method, every segment either has its `inherent duration’ or one that is derived by such temporal implementation rules depending on the context. 

 

Leaving aside the issue of how accurately this way of specifying the rules works (of course, there are several other proposals for how to write these rules, e.g., van Santen, 1996), there are many reasons why this entire approach is implausible. The first problem is the question of what use these durations in milliseconds might be? Who or what will be able to use these numbers to actually achieve a target duration of 79.7 ms for this vowel?  There is no existing model for motor control that could employ such specifications.  We need another theory of motor control to make use of these “specs” and attempt to generate a vowel of n milliseconds duration (see Fowler, Rubin Remez, & Turvey, 1981; Port, Cummins, & McAuley, 1995).  Second, durations in milliseconds seem fundamentally misguided since speakers talk at a range of rates. So it seems that it should be relative durations that are employed, not absolute durations (see Port, Cummins, & McAuley, 1995). Third, such a system has no apparent way for global timing patterns (eg, regular stress timing) to influence timing.  This model computes a duration for one segment at a time only. Longer intervals get their duration just by adding up the individual segments that comprise it.  Finally, fourth, if the duration pattern for the Germanic voicing contrast is really the relative duration of a vowel to the following consonant, this kind of model will have difficulty. The vowel duration effect of the voicing is computed by a context-based rule like the one above, while the stop closure is just the inherent duration of the following consonant. It is difficult to see how a rule-governed duration and an inherent duration could be coordinated to assure a particular durational ratio since the ratio itself plays no role in the rules.

 

Despite these implausible features, it is difficult to prove the impossibility of such an account. After all, if formal models can simulate a Turing machine, they are very likely to be able to deal with such relational temporal phenomena by some brute-force method. But an implementational solution along this line is only interesting if certain very specific constraints are applied to the class of acceptable formal models, as Chomsky has frequently pointed out (1965). And, if one can always add additional phonetic symbols to the universal set and apply as many rules as you please, then it could be claimed that only the Germanic language group happens to employ a particular feature that is universally implemented in this temporal way (even though no nonGermanic languages have been observed to exhibit it).  But to always permit postulation of new symbols for every new temporal effect is surely not sound scientific method.

 

Yet, short of proliferation of unique features, an implementation rule for the Germanic voicing effect cannot be universal.  Most languages in the world (including, e.g., French, Spanish, Arabic, Swahili, for example) do not exploit the relative duration of a vowel to the following stop or fricative constriction as correlates of voicing (Chen, 1970; Port, Al-Ani, & Maeda, 1980).  We know from classroom experience that if you play English stimuli varying in vowel and/or stop closure duration -- stimuli that lead native English speakers along a continuum from rabid to rapid -- those stimuli with varying V/C ratio will tend not to change voicing category at all for French, Spanish or Chinese listeners. Their voicing judgments are almost completely unaffected by V/C ratio. They just pay attention to glottal pulsing during the constriction. Such durational manipulations may affect the naturalness of the stimuli, but do not make them sound more Voiced or less Voiced.

 

The most plausible conclusion to be drawn from this situation is that the Germanic languages share the property of manipulating V/C ratio for distinguishing classes of words from each other.  English listeners, at least, make a categorical choice between two values of a feature that might be described as `Voicing' (or as `Tensity' or `Fortis/Lenis'). But there is nothing universal about this property. It just happens to be a way that one family of closely related languages controls speech production and speech perception to distinguish vocabulary items.  Thus, we have an undeniable temporal pattern, one that requires some rather specialized machinery to produce or perceive, yet which apparently must be a learned property of the phonological grammar of specific languages as a `feature’ for contrasting sets of words. To call this distributed temporal pattern a `symbol,’ that is, a static token, is to make it impossible to see what it really is — an intrinsically temporal pattern that is part of the specification of words in this group of languages.

 

 

Section III: Prosodic Timing Effects

 

There are other well-known timing effects as well that are related to the prosodic patterns of languages.  Kenneth Pike first commented in 1945 that the languages of the world seem to fall into two general types as far as their prosodic pattern is concerned stress-timed languages and syllable-timed languages. In modern linguistics, prosodic timing is most often viewed in terms of this dichotomy. The taxonomy rests on the impression that languages seem to have distinct rhythmic styles. Languages like English and Russian seem to be stress-timed, since stressed syllables have a tendency to be regularly spaced in time, whereas in French and Spanish, it seems that each syllable is trotted out in a regular rhythm. This is based on auditory impressions of regular time intervals, as, for example, by tapping a finger on a table (Jones, 1918/1932; Pike, 1945; Abercrombie; 1967).  The taxonomy implies a hypothesis of isochrony, or equal spacing in time. Impressionistically, equal time intervals seem to apply to the level of stressed syllables in English and Russian. Thus, if we say, eg, `He EATS poTAtoes toDAY’, the stressed syllables seem equally spaced. One could tap a finger for each one. But if we say `He’s EATen the poTAtoes toDAY’, it seems like the timing is the almost same even though there are now two additional unstressed syllables inserted between EAT- and -TA- (especially if you tap your finger on each stress). On the other hand, in French, for example, if we say `Je ne parle pas francais’, it seems like a finger could be tapped for each of the 5 syllables, and that all are about equally spaced.   Pike, seconded by Abercrombie, suggested that all languages might fall into one type or timing pattern of the other.

 

When careful instrumental measurements are used, however, it is typically found that neither inter-stressed intervals nor inter-syllable intervals are actually produced isochronously (Classé, 1939; Bolinger, 1965; Pointon, 1980; Wenk & Wioland, 1982; Tajima, 1998).  Of course, it should not be a surprise that perfect isochrony is not found. That is a very challenging test to meet.   Naturally, lacking a realistic theory of timing and given such a difficult test to meet, experimental studies have repeatedly failed to support hypotheses of isochrony. The question is ``Just how regular does the timing have to be to support such a hypothesis?’’  No answer was available.

 

Furthermore, investigation of the mora in Japanese during the 1980s called into question the original simple taxonomy. A third type of timing pattern was revealed. In Japanese, a mora can be a simple CV syllable, like -ta-, but syllables with a long vowel, like –too- or a final consonant, like –han-, count as two moras.  Thus a word like, say, Honda  has 3 moras, just like Fujita. And Tookyoo has 4 moras like Fujiyama. In the Japanese syllabic writing system (that is, in the hiragana and katakana scripts), each mora is written with a single symbol.)  The traditional description of the mora found in Japanese pedagogy holds that it is an isochronous unit that all moras have equal duration (so that Honda and Fujita should take the same amount of time to pronounce). Despite experimental support for the mora as an isochronous temporal unit (Han, 1962; Port et al., 1980; Homma, 1981), it has remained somewhat controversial among phoneticians (Beckman, 1982).

 

Disagreements about the mora arose primarily from whether each individual mora is the same duration as each other mora -- so that any compensation for inherently longer or shorter consonants and vowels is achieved within the mora  (Beckman, 1982)-- or whether compensation may occur in neighboring moras. In the latter case, the regularity of mora timing should be observable only by looking at a window of several moras, e.g., a whole word or phonological phrase (Port et al, 1980, Port et al, 1987; Han, 1994).

 

 

 

Figure 2. Five native Japanese speakers (Port et al., 1987) read words from 6 word groups (illustrated in Table 2). The word durations are pooled across subjects according to the number of moras. Each word group is labeled by the initial mora (so ka is the first mora in all tokens with that group). As the number of moras increases the word duration increases linearly in both fast and slow tempos. These data show that word duration is almost completely determined by the number of moras, not the number of syllables. Other experiments show that if a mora is inherently short, then segments in the preceding and following syllables stretch to compensate.

 

Number of moras

Word

English gloss

HI group

 

 

1

hi

sun

2

hika’

subcutaneous, hypodermal

3

hi’kka

faux pas, misstatement

4

hikka’ku

to scratch, claw

5

hikkake’ru

to hang, hook

6

hikkakena’i

not to hang

7

hikkakerare’ru

get hung

Table 1. An example of number of moras for one group of words used in Port et al. (1987). The other groups followed a similar pattern of stacking up moras. The apostrophe indicates a pitch accent in the utterance.

 

In the Port et al. (1987) it was found that the duration of whole words with the same number of moras had nearly the same duration no matter what their segmental content or number of syllables (Figure 2 and Table 1).   Changes in speaking rate resulted in the average mora duration shortening. Other experiments show that within a word, neighboring segments within the mora as well as adjacent moras expand and compress in compensation for their neighbors, as shown in Figure 2. Other experiments have shown that perceptual segmentation of speech based on mora units is also found in prelexical processing in Japanese (Otaka, Hatano, Cutler & Mehler, 1993; Cutler & Otake, 1994) suggesting that the mora is highly salient as a unit in the production and perception of Japanese. So, if measurements are done just the right way, Japanese does show a type of ioschrony even if it is neither syllable timing nor stress timing. More rigorous methods are surely needed for studying English and other languages.  No one should have expected absolutely equal durations for any units in speech. Furthermore, English speakers can sometimes exhibit apparent syllable timing, for example in certain styles of babytalk. So a more flexible experimental and theoretical framework was needed to explore these issues without assuming that only one timing principle can apply to each language.

 

Speech Cycling Experiments. 

 

The idea of speech cycling evolved from consideration of research methodologies employed in studies of limb motor control by studying behavior of the limbs when two periodic patterns are interfering with or interacting with each other (e.g. , Haken, Kelso, & Bunz, 1985; Kelso, 1995; Treffner & Turvey, 1993). The idea of speech cycling is to study the timing of repetitive speech when interacting with, or coupled with, an external auditory metronome (Cummins & Port, 1998; Cummins, 1997; Tajima, 1998; Port, Tajima, & Cummins, 1999; Tajima & Port, in press).  Subjects repeat a short piece of text in time with a computer controlled metronome-like pattern. It turns out that there are strong constraints on the method of performing these tasks, some of which are apparently universal, but others of which characteristically differ from language to language.  If it is true that speakers of different languages perform such tasks in ways that differ between languages, this would be further evidence that the actual `grammars’ employed by native speakers include temporal characteristics.

 

So, for example, notice that if English speakers repeat a phrase like `Take a pack of cards, Take a pack of cards, Take a pack of cards, …’  at a rate of about 1 repetition per second, they are likely to pronounce it putting a pitch accent on both take and cards.  The pattern of timing of the beats for the these two pitch accents is very likely to be one or the other of two rhythms -- either a slow two beats (as in ``1, 2; 1, 2; TAKE a pack of CARDS; TAKE a pack of CARDS) or else in a fast three-beat pattern (1, 2, 3; 1, 2, 3; TAKE a PACK of CARDS; TAKE a PACK of CARDS; TAKE a PACK of CARDS).    The 2- and 3-beat harmonic timing patterns are probably universal (since these meters show up in most musical traditions), but locating the onsets of pitch-accented syllables at the beats may be an idiosyncratic property of English.

Figure 1. This figure demonstrates the use of speech cycling in Cummins & Port (1998). In this study, a corpus of 30 short phrases with identical prosodic structure was used. Each phrase was of the form  X for a Y” like beg for a dime. Subjects repeated these types of phrases to a metronome pattern and the beat, or vowel onset location, was found for each target syllable. A succession of 14 pairs of alternating high (H) and low (L) tones is used in each trial. The interval from the high to low tone was fixed at 700ms (to keep speaking rate constant). The relative time of the low tone to the next high tone varied randomly over the range .2 - .7 (observed phase = the duration in milliseconds between the H and L tones divided by the duration from one H tone to the following one). The subjects were asked to speak the phrases such that the first word of each phrase (e.g., beg) lined up with the high tone, and the last word (eg, dime) lined up with the low tone. The simplest hypothesis is that there are no rhythmic constraints on speech production. If this were true, subjects should have produced the onset of the final stress (dime) wherever the L tone occurred in the phrase cycle (from .2-.7). Instead, the histograms are strongly multimodal, with three clear modes, near .33, .5, and .66, but subject KA showed only two modes at .3, and .5. Thus the subjects could not accurately perform the task but were strongly biased to locate the final syllable onset at harmonic fractions of the complete H-H cycle (that is, at ½, 1/3 or 2/3).

 

By picking analogous text material in two languages and urging subjects to adopt specific rhythmic patterns, it is possible to compare the ease and difficulty of various temporal patterns in two languages. For example, Tajima & Port (in press) employed a three-beat (waltz-like) speech cycling pattern to reveal distinct temporal characteristics for English and Japanese. It was found that English speakers show more `stress-timing effect’ that do Japanese speakers in similar situations.  Thus the characterization of at least English and Japanese as stress-time and mora-timed can be justified experimentally. Furthermore they found evidence that individual English speakers can slip from timing that favors equal timing for individual syllables to timing that is more strongly concerned with the spacing of just the stressed syllables.

 

These data strongly suggest that prosodic features of timing should be included as part of a linguistic grammar – if by a grammar, we mean the idiosyncratic characteristics of each language that differentiate it from others. One way in which languages differ, as shown by speech cycling experiments, is in how syllables are identified as `prominent’ (cf. Tajima, Zawaydeh & Kitahara, 1998; Cutler et al. 1986). Since there are temporal aspects that differ from language to language, the notion of a grammar must be expanded to include such properties. The alternative, to leave timing out of our understanding of language, is to underestimate what is involved in speaking a language competently. 

 

Entrainment to Periodic Patterns.

Even leaving aside the issue of language differences in timing, the very naturalness of speech cycling tasks –  that is, the tendency, even without a metronome, to adopt a rhythmical pattern in our speech – especially when the speech is repetitive – reveals something fundamental about human language.  Like so many other human behaviors, speech is very naturally and unavoidably entrained to periodic patterns. There are countless examples of periodically structured speech: songs, auctioneering, preaching, sports announcing, canonical babbling in infants, marching chants, work songs, etc. (Port, et al., 1999; Ejri, 1998). Even everyday speech tasks like reading a list of words aloud is enough to strongly encourage regular speech periodicity (Leary & Muench, 1998). Like the speech cycling technique, these examples of everyday speech remind us of the ubiquity of cyclic behavior: from breathing and running to pacemaker cells in the heart. Therefore, is it surprising that we find speech may exhibit timing similarities to other biological and neurological functions?  We believe it is a good bet that many cognitive functions, both linguistic and non-linguistic, will eventually be shown to entrainable to rhythmic patterns at some appropriate rate.

 

To acknowledge similarities in the timing of speech to timing found in other biological systems (animal and human) implies a description of speech that is dynamical and changes in time[1]. Formal logic and symbolic phonological accounts of prosody (Chomsky & Halle, 1968; Liberman & Prince, 1977; Selkirk, 1984; Hayes, 1995) provide us with no basis for interpreting the temporal characteristics of stressed and unstressed syllables (van Gelder & Port, 1995; Port et al., 1999). Yet, as we have seen, speech prosody does affect the timing of language in many important ways. So once again, the strong tendency of speech to entrain to periodicity like other motor actions is evidence supporting the view that the linguistic structure is intrinsically dynamic, not symbolic.

 

Section IV:  Nondiscreteness in Language

 

We have pointed out that certain phenomena are counterevidence for the Symbolic Language hypothesis. One kind of counterevidence is demonstration that temporal phenomena are sometimes intrinsic to language, as we have demonstrated in the previous sections. A second kind of critical evidence would be a convincing demonstration of patterns that are linguistically distinct and yet not discretely different – not different enough that they can be reliably differentiated and yet are not the same either. This is a difficult set of criteria to fulfill, but in fact such a situation has been demonstrated in a number of experiments for several languages.

 

Position of stop

German word

 

English gloss

Initial

der Back

 

mess table

 

der Pack

 

pack, bundle

Medial & final

Plural

Singular

 

 

Alben [alben]

Alb [alp]

elf

 

Alpen [alpen]

Alp [alp]

mountain pasture

Table 2. Some examples illustrating the traditional data regarding “word-final devoicing” in German. Using an atomic inventory of static features, the following phonological rule is said to apply: [-sonorant] ’ [- voice]/ ____$, where $ is a syllable boundary.

 

The best studied case is in the near-neutralization of voicing in syllable-final position in Standard German. Here, final voiced stops and fricatives, as in Bund and bunt (`club’, `colorful’), are said to neutralize to the voiceless case as shown in Table 2. That is, although Bunde and bunte show that the words contrast in the voicing of the apical stop, the pronunciation of Bund and bunt seem to be the same. Both sound like [bUnt].  But the difficulty is that they are not pronounced exactly the same (Port, Dalby & O’Dell, 1987; Port & Crawford, 1989).  These pairs of words, with final stops and fricatives, actually are slightly different as shown in Figure 4.  If they were the same, then in a listening task you would expect 50% correct (pure guessing – like English too and two); if different, you expect 99% or better correct identification under good listening conditions (like German Bunde and bunte).  Instead, they are different enough that listeners can guess correctly which word was spoken with only about 60-70% correct performance (Port & Crawford, 1989)!  But such performance shows that the word pairs are neither clearly the same nor clearly different. The voicing contrast is almost neutralized in this context, but not quite. The differences can be measured on sound spectrograms, but for any measurement one chooses (vowel duration, stop closure duration, burst intensity, amount of glottal pulsing during the closure, etc.), the two distributions overlap considerably.

Figure 4. Schematic waveforms of several sample German word pairs by one of the speakers in Port & O’Dell (1985). The onset of the first vowel is begins at  0 ms extending for the length of the white recangle, the smaller gray rectangle is the period of voicing visible during stop closure, the straight line is the stop closure which is voiceless, and the triangle represents the stop burst duration (the release of the stop). These results do not support the notion of a static, binary voicing feature ([±voice]). While the timing for the voiced and voiceless word pairs are similar, there is a tendency for the vowel before the “underlying” voiceless obstruent (e.g, the vowel in Alp) to be slightly shorter then the voiced one. There is also more voicing into the stop closure for the voiced stops and a longer stop burst for the `underlying’ voiceless stops than for the corresponding voiced ones.

 

 

One might hope that this could be blamed on some speakers reliably differentiating them and others just guessing.  But most listeners do better than chance.  So these minimal pairs lack an essential property of any symbol token (Manaster-Ramer, 1996; Port, 1996): They could not be said to be either discretely different or distinct. Yet this situation could not be a mere performance effect either, since all speakers of the language (or at least most) exhibit this rule-governed pattern in many dialects including Standard German.  Nor is this situation unique. A similar phenomenon can be found in most varieties of American English where the /t/ and /d/ in butting and budding are said to be `neutralized’ to an apical flap. But again statistically distinct distributions can be found and listeners can typically guess which word was spoken with better than chance accuracy. (Fox & Terbeek, 1976; Port, personal communication). 

 

Altogether then, there is considerable evidence that several essential and unavoidable predictions of the SL hypothesis are shown to have clear exceptions. In our view, the SL hypothesis along with the AP hypothesis must be abandoned. Surely some properties of language exhibit nearly symbol-like properties, but apparently this is not all there is in human languages. This means that linguistics cannot make the convenient assumptions of timelessness and digitality.  We believe this implies that linguistics must assume a rather different kind of processing system underlies the grammars of skilled speakers of a language. These production and perception mechanisms will:

 

1.       process stimulus phenomena and interact with the world in real time,

2.       exhibit a tendency to behave periodically and to entrain to periodic patterns at the time scale of the mechanisms of body movement (let’s say, with periods longer than 20 ms).  That is, the physical systems of our bodies and the neural system that is coupled to it in realtime, cannot help being dynamical and behaving like oscillators under certain conditions.

3.       The cognitive mechanisms should apparently exhibit a strong tendency toward discrete categoricity in the production and perception of speech sounds, but it must not be required to successfully achieve this all the time.

 

Of course, a satisfactory psychological theory for linguistics to build on does not yet exist. But there are some attempts at theories of speech perception and production that appear promising to us. They will be introduced in the next two sections.

 

 

Section V: Dynamical models of speech perception

 

Naturally, there have been a few who have insisted on biologizing, or embodying, linguistic behavior all along. There is work on the peripheral aspects of language in this mode – both models of speech perception and others about speech production. Some of the models of speech perception are now fairly sophisticated and plausible. Grossberg has been developing his ART model, Adaptive Resonance Theory, for many years. This system models pattern recognition based on partial matches and learns to recognize new patterns given repeated exposure to novel inputs. The model is able to continue learning at all times and to simultaneously acquire new categories when needed (Grossberg, 1980, 1986, 1995).  In recent years Grossberg and his group have elaborated the model to directly address issues in the speech perception literature that have sat on the table for a decade with few attempts at direct modeling

 

To describe the ART model, we can look first at how it categorizes sensory inputs. An archetypal case might be to recognize a letter or other visual object. An input light pattern excites a particular set of feature detectors in a field called F1. (These features that appear in F1 were extracted earlier from repeated stimulation by an unsupervised learning process.)  A neural network with learned weights connects this field of features to an F2 field. F2 has mutually competitive units specialized for each basic category that has also been learned by the system from exposure to the environment.  Weights on the F1-F2 connections assure that one unit in F2 (where a unit can be either a single node or a group of nodes) will be somewhat more excited than any of the others. As soon as this excitation exceeds a threshold, this unit begins to inhibit its competitors and sends a feedback signal unit  back to the units of F1, exciting only the F1 units that have a learned statistical relationship with this F2 unit during previous exposures. This feedback signal represents an attempt to match expectations to the actual input pattern in F1.  If enough F1 units receive input from both the sensory system and topdown feedback, then a `resonance loop' is established; F1 excites F2 which excites F1 and so on – the system has reached a stable attractor.  (If the match between expectations and sensory inputs is not good enough, then the first-guess F2 unit is inhibited as another cycle begins again using an alternative F2 hypothesis.) The perceptual experience of `seeing’ an object or `hearing’ a word spoken, according to Grossberg, can only occur when a resonance loop has been achieved. 

 

Grossberg’s group has recently shown that with very little modification, this model can also recognize word-like patterns at F2 with auditory stimuli arriving over time and even employ temporal constrasts to distinguish words despite variation in rate (Boardman, et al, 1999; Grossberg and Myers, in press).  Since all the linguistic units (phonemes, words, etc) are discovered by the system on the basis of statistical analysis of inputs, discreteness and context independence are not required properties. Fuzzy categories, partially discrete categories and context dependent variants would appear to be quite natural results of the acquisition processes employed by this kind of system.  The most economical representation will not necessarily be the shortest or the one requiring the fewest stored objects.

 

Masking Fields.   To apply ART to spoken language in general requires some additional theoretical development since auditory inputs that arrive over time are also hierarchically structured as language-specific sound segments, morphemes, words, phrases, etc. The notion of a masking field is an interesting idea – even though it has not yet been fully developed (Cohen & Grossberg, 1987, Grossberg, 1986). A masking field is a considerable expansion of the F2 level in which there are units corresponding to perceptual objects on several spatial or temporal scales. Any word or phrase has other, shorter alternative parsings. Thus, if we pronounce the word catalogue, we have also pronounced cat, cattle, a log, etc. as well as just the phonetic segments /k,Q,R, I, l, A, g|…/.  Why do we hear the single word only rather than various other possible components that are also somehow there?  Why do we `see’ (that is have an awareness of seeing) a letter E, when we might also, or instead, have `seen’ one vertical bar and three horizontal bars?  The masking field allows all the alternative coherent parsings to be activated but there is a hardware bias in favor of larger or longer identification units so that they usually win out over shorter ones and reach the state of resonance. This bias for the longest possible unit that is compatible with all the evidence accounts for why we are aware of only catalogue but not cattle, log, etc.

 

For language perception, a masking field of some kind might be imagined that employs a series of levels and learned category-like units – one level for each linguistically relevant layer: phonemes, morphemes, words, common phrases, etc. The specific units on each level are attractors that are (a) acquired through a learning process and (b) are susceptible to change over time, (c) may not always be discretely different from one another, and (d) may manifest themselves as temporal patterns (including especially periodic ones).  Only time will tell if dynamical models of cognitive behavior of this style will prove to capable of underpinning actual linguistic analysis of phonologies and syntax.

 

Section VI: Dynamical Theories of Motor Control

The issue of motor control for speech has been strongly influenced by work on general motor control – especially of the limbs. Since Bernstein (1967), there has been a strong tradition of dynamical models of speech. There isn’t space here to go into this issue very far, but recent work in the dynamical tradition (Browman & Goldstein,1995; Saltzman & Munhall, 1989; Guenther, 1995) has shown how many aspects of speech production are naturally interpretable in terms of dynamical behavior of the neuro-physiological system that is the speech apparatus.

 

These dynamical models of motor control and perception of compositional patterns are still in their infancy as far as being explicit systems that account for speech and language. They scarcely begin to address most issues that are of interest to linguists. Nevertheless they appear able to plausibly account for such properties as:

 

(a)    parsing at multiple scales simultaneously yet delaying perceptual decisions until necessary information is available, thereby implementing hierarchical structure presented in time,

(b)     discreteness of categorical perception in many cases (Liberman et al., 1967; Kuhl & Iverson, 1995),

(c)     priming of words even when not consciously perceived,

(d)     attraction of prominent syllable onsets to periodic patterns  and nested hierarchical structures like feet, phrases etc.(Cummins & Port, 1999; Tajima & Port, in press),

(e)    the incommensurability of speech sounds from language to language (Logan, Lively and Pisoni, 1991; Port, Al-ani and Maeda, 1980).

 

These provide a good start toward mechanisms capable, in principle, of dealing with the nearly discrete (and somewhat symbol-like) language structures by using mechanisms that are fundamentally continuous and dynamical and which operate in realtime.

 

 

Section VII. Conclusions

 

We began this review of the problem of timing and temporal patterns in human speech by first exploring the theoretical constraints regarding the issue of timing that stem from the ubiquitous assumption within linguistics that language is, in fact, a formal symbolic system (and not something that merely approximates a formal symbolic system in many circumstances). This assumption, which seems so obvious to linguists as to scarcely require any justification at all, turns out to have devastating consequences for understanding how timing could play any role in language.

 

The Symbolic Language assumption (a) prevents timing from being visible as a property of human languages (thereby shortcircuiting research on the many temporal aspects of phonetics and phonology), (b) forces the highly implausible assumption that phonetics is based on an apriori universal segmental inventory, (c) prevents exploitation of data on temporal phenomena (such as processing time, reaction time, response latency, etc) thereby undermining research in psycholinguistics.  Further, (d) it forces the postulation of a sharp boundary between the formal, symbolic discrete time domain of language and human cognition (‘competence’) in contrast to the continuous, fuzzy, realtime domain and human physiology (‘performance’). This gap has thus far proven unbridgeable and will probably remain so as long as the assumption that language is nothing but a formal symbolic system holds sway within the linguistics and biases research in the disciplines that find themselves dependent on aspects of linguistic theory, like phonetics, phonology, psycholinguistics, neurolinguistics, etc.

 

Of course, despite such severe theoretical misdirection, phoneticians and speech scientists have gone ahead and studied temporal phenomena in languages anyway.  They have done so for many reasons, most often not to discover properties of individual languages but in order to develop natural sounding synthetic speech (Klatt, 1976; van Santen, 1996), or practical speech recognition, or to understand and treat disorders of speech production or perception. 

 

In doing research on speech timing, they have found that many acoustic features are manipulated by specific languages in the specification of words, including such features as: the risetime of energy onset, the duration of acoustic segments like vowel duration, stop closure duration, fricative or nasal consonant duration, and so on. A variety of specific rules for temporal implementation of phonological features and segments have been developed to `account for' these patterns, even though rules that specify durations in terms of absolute timing like milliseconds are implausible for many reasons.  Still, specifically durational cues for particular phonological features, like postvocalic stop and fricative voicing in many Germanic languages) are now well documented.

 

In addition, the most convincing cases of nondiscrete sound contrasts, or semicontrasts, find much of their evidence in the time domain. The incomplete neutralization due to the word-final devoicing rule in German, like the incompletely neutralizing flapping rule for American English /t/ and /d/, has consequences that are most clearly observed in the time domain.

 

Furthermore, this research has turned up evidence of some larger scale temporal patterns in human language like the tendency toward regularity of speech timing of stressed syllables in some languages (like English), regularity of the timing of moras (in Japanese), and the possible regularity of intersyllabic intervals (in Chinese or French, although we have not seen good experimental evidence to back up the intuitive feel of such a timing strategy).

 

So, all of these phenomena make the traditional linguistic view that speech sounds are nothing but serially ordered static segments (where the segments are vectors of static distinctive features) highly unlikely.  Timing is apparently intrinsic to human language, not imposed only during output processes.  This should be reassuring in many ways, because it supports a view of language as real psychological behavior, as a cognitive skill, embodied in real human brains, rather than as a static inventory of structures and formal rules in some Platonic idealized space.

 

Of course, the consequence of such a view is that the traditional discrete-time symbol manipulation processes for symbol strings and trees so familiar in linguistic thinking for the past 40 years, simply cannot provide the mechanisms that will be required to understand the production and perception of words, phrases and sentences.  An entirely new psychological theory will have to underlie all linguistic thinking. What kind of mechanisms could these be?

 

In fact, theoretical approaches to psychological mechanisms that are compatible with a view of language that takes time and timing seriously have been under development in various laboratories in recent years.

These theories are unfamiliar, not only to most linguists, but even to many psychologists.  The most sophisticated systems for language perception and production seem to us to be those developed from the

ART model by Steven Grossberg and his colleagues.  These models assume that perceptual information arrives in time (rather than being statically presented all at once), they can delay perceptual

commitment when necessary until essential information arrives, employ temporal information of various kinds and learn intrinsically temporal patterns, and can learn discrete categories from experience while apparently still being capable of acquiring nondiscrete units as well. These models are complex and demand mathematical and programming sophistication that are rare within linguistics and even experimental psychology.  Certainly, they still require much further development. But, they do suggest the feasibility of continuous time models capable of dealing with such complex temporal structures as human language.

 

We hope that these phenomena and theoretical considerations will lead to development of a theory of language that is genuinely capable of embodiment in the human body functioning in a physical world.

 

Bibliography

 

Abercrombie, David. (1967). Elements of general phonetics. Chicago: Aldine Pub. Co.

 

Beckman, Mary. (1982). Segment duration and the `mora' in Japanese. Phonetica, 39, 113-135.

 

Bernstein, N. (1967) Coordination and Regulation of Movements. (Pergamon Press, London).

Bloomfield, Leonard. (1926). A set of postulates for the science of language. Language, 2, 153-164.

 

Boardman, I., Grossberg, S., Myers, C., and Cohen, M. (1999). Neural dynamics of perceptual order and

context effects for variable-rate speech syllables. Perception & Psychophysics, in press.

 

Bolinger, Dwight. (1965). Pitch accent and sentence rhythm. In Isamu Abe and Tetsuya Kanekiyo (Eds.),

Forms of English: Accent, Morpheme, Order (pp. 139-180). Cambridge, MA: Harvard University

Press.

 

Browman, Catherine  and Louis Goldstein (1995) Dynamics and articulatory phonology. In R. Port and T. van Gelder (eds) Mind as Motion: Explorations in the Dynamics of Cognition. (MITP, Cambridge), pp. 175-194.

 

Chen, Matthew (1970) Vowel length variation as a function of the voicing of the consonant environment.  Phonetica 22, 129-159).

 

Chomsky, Noam, & Halle, Morris. (1968). The Sound Pattern of English. New York: Harper & Row.

 

Chomsky, Noam. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.

 

Clark, Andy (1997) Being There: Putting Body, Brain and World Together Again. (Cambridge: MITP)

 

Classé, Andre. (1939). The Rhythm of English Prose. Oxford: Basil Blackwell.

 

Cohen, Michael A. and Steven Grossberg (1987) Masking fields: a massively parallel architecture for learning, recognizing and predicting multiple groupings of pattenred data. Applied Optics 26, 1866-1891.

 

Couper-Kuhlen, Elizabeth. (1993). English Speech Rhythm. Pragmatics and Beyond. Amsterdam: John

Benjamins.

 

Cummins, Fred. (1997). Rhythmic coordination in English speech: An experimental study. Doctoral

dissertation, Indiana University.

 

Cummins, Fred. (1999). Synergetic Organization in Speech Rhythm. In W. Tschacher & J.-P. Dauwalder

(Eds.), Dynamics, Synergetics, Autonomous Agents: Nonlinear systems approaches to Cognitive

Psychology and Cognitive Science. Studies of Nonlinear Phenomena in Life  Science – Volume 8

(pp. 256-266). New Jersey : World Scientific.

 

Cummins, Fred & Robert F. Port. (1998). Rhythmic constraints on stress timing in English. Journal of

Phonetics, 26, 145-171.

 

Cutler, A., Mehler, J., Norris, D.G., & Segui, J. (1986).The syllable’s differing role in the segmentation of

French and English. Journal of Memory and Language, 25, 385-400.

 

Cutler, A., & Otake, T. (1994). Mora or phoneme? Further evidence for language-specific listening.

Journal of Memory and Language, 33, 824-844,

 

Ejri, Keiko. (1998). Relationship between rhythmic behavior and canonical babbling in infant vocal

development. Phonetica, 55, 226-237.

 

Fodor, J. A. (1975). The Language of Thought. New York : T. Y. Crowell.

 

Fowler, Carol A., Rubin, P., Remez, R., & Turvey, M. (1981). Implications for speech production of a

general theory of action. In B. Butterworth (Ed.), Language Production (pp. 373-420). New York:

Academic Press.

 

Fox, R. and D. Terbeek (1977) Dental flaps, vowel duration and rule ordering in American English. J. Phonetics 5, 27-34.

 

Grossberg, Steven (1980) How does the brain build a cognitive code?  Psych Review 87, 1-51.

 

Grossberg, Steven (1995) Neural dynamics of motion perception, recognition learning and spatial attention. In Port and van Gelder (eds) Mind as Motion: Explorations in the Dynamics of Cognition (Cambridge; MIT Press), pp. 449-490.

 

Grossberg, Steven (1986)  The adaptive sef-organization of serial order in behavior: Speech, language, and motor control. In E. C. Schwab and H. C. Nusbaum (eds.) Pattern Recognition in Humans and Machines. Vol 1: Speech Recognition, pp 187-294 (Orlando: Academic Press)

 

Grossberg, S. and Myers, C.W. (2000). The resonant dynamics of speech perception: Interword integration

and duration-dependent backward effects. Psychological Review, in press

 

Guenther, F.H. (1995). Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. Psychological Review, 102, 3, 594-621.

 

Haken, H., Kelso, J. A. S., & Bunz, H. (1985). A theoretical model of phase transitions in human hand

movements. Biological Cybernetics, 51, 347-356.

 

Halle, Morris, & Stevens, Kenneth N. (1980). A note on laryngeal features. Quarterly Progress Report,

Research Lab of Electronics, MIT, 101, 198-213.

 

Han, Mieko S. (1962). The feature of duration in Japanese. Study of Sounds, 10, 65-75.

 

Haugeland, John. (1985). Artificial Intelligence: The Very Idea. Cambridge, MA: Bradford Books, MIT

Press.

 

Hayes, Bruce. (1995). Metrical stress theory: Principles and case studies. Chicago: University of Chicago

Press.

 

Hockett, C. (1954). Two models of grammatical description. Word, 10, 210-231.

 

Homma, Y. (1981) Durational relationship between Japanese stops and vowels.  J. Phonetics 9, 273-281.

Jakobson, R., Fant, G., & Halle, M. (1952). Preliminaries to Speech Analysis: The Distinctive Features and

Their Correlates. , Cambridge, MA: MIT Press.

 

Jones, D. (1932). An Outline of English Phonetics. Cambridge: Cambridge University Press, 3rd edition,

1st edition published 1918.

 

Kelso, S. (1995). Dynamic Patterns: The Self-organization of Brain and Behavior. Cambridge, MA: MIT

Press.

 

Klatt, D. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence.

Journal of the Acoustical Society of America, 59, 1208-21.

 

Kuhl, Patricia & Paul Iverson (1995). Linguistic experience and the `perceptual magnet effect'.  In W.

Strange (Editor). Speech Perception and Linguistic Experience: Issues in Cross-Language Research

(York Press: Baltimore) pp.121-154                          

 

Leary, A. and C. Muench. (1998). Attractor dynamics in spoken English: Evidence from list reading.

Chicago Linguistics Society 35: The Panels. ChiPhon ’99 New Syntheses: Multi-Disciplinary           Approaches to Basic Units of Speech. Chicago, IL: CLS.

 

Lehiste, Ilse (1970) Suprasegmentals. MIT Press.

 

Liberman, A. M., F. Cooper. D. Shankweiler and M. Studdert-Kennedy (1967) Perception of the speech code, Psychological Review 74,  431-461.

 

Liberman, M., & Prince, A. (1977). On stress and linguistic rhythm. Linguistic Inquiry, 8, 249-336.

 

Lehiste, I. (1970). Suprasegmentals. Cambridge, MA: MIT Press.

 

Lisker, Leigh, and Arthur Abramson. Distinctive features and laryngeal control. Language, 44:767-785,

1971.

 

Lisker, Leigh. Rabid vs. rapid: a catalogue of cues. Haskins Laboratories Status Report on Speech

Research, 1985.

 

Logan, John, Lively, and D. Pisoni (1991). Training the perception of /r/ and /l/: First report.  J. Acous. Soc America  89, 874-886.

 

Manaster Ramer, Alexis (1996) A letter from an incompletely neutral phonologist. J. Phonetics 24, 477-489.

 

Newell, Allen, & Herbert Simon. Computer science and empirical inquiry. Communications of the ACM,

            pages 113-126, 1975.

 

Otake, T., G. Hatano, A. Cutler & J. Mehler. Mora or syllable? Speech segmentation in Japanese. Journal

of  Memory and Language, 32:358-378, 1993.

 

Peterson, Gordon E.,  and Ilse Lehiste. Duration of syllable nuclei in English. Journal of the Acoustical

Society of America, 32:693-703, 1960.

 

Pind, J. Speaking rate, VOT and quantity: The search for higher-order invariants for two Icelandic speech

cues. Perception & Psychophysics, 57:291-304, 1995.

 

Pike, Kenneth Lee. The Intonation of American English. University of Michigan Press, Ann Arbor, 1945.

 

Pointon, G. Is Spanish really syllable-timed? Journal of Phonetics, 8:293-305, 1980.

 

Port, Robert (1981) Linguistics timing factors in combination.  J. Acous. Soc. Amer. 69, 262-274.

 

Port, Robert (1996)  The discreteness of phonetic elements and formal linguistics: response to A. Manaster Ramer. J. Phonetics 24, 491-511.

 

Port, Robert F., Salman Al-Ani, and Shosaku Maeda. Temporal compensation and universal phonetics.

Phonetica, 37:235-252, 1980.

 

Port, Robert F., Fred Cummins, and J. Devin McAuley. Naive time, temporal patterns and human audition.

In Robert F. Port and Timothy van Gelder, editors, Mind as Motion: Explorations in the Dynamics of Cognition.. MIT Press, Cambridge, MA, 1995.

 

Port, Robert and Penny Crawford (1989)  Pragmatic effects on neutralization rules,  J. Phonetics 16, 257-282.

 

Port, Robert, and Jonathan Dalby. C/V ratio as a cue for voicing in English. Journal of the Acoustical

Society of America, 69:262-74, 1982.

 

Robert F. Port, Jonathan Dalby, and Michael O'Dell. Evidence for mora timing in Japanese. Journal of the

Acoustical Society of America, 81(5):1574-1585, May 1987.

 

Port, Robert F., and Fares Mousa Mitleb. Segmental features and implementation of English by Arabic

speakers. Journal of Phonetics, 11:219-229, 1983.

 

Port and Michael O'Dell. Neutralization of syllable-final voicing in German. Journal of Phonetics,

13:455-471, 1985.

 

Port, Robert , Keiichi Tajima, & Fred Cummins. Speech and rhythmic behavior. In Geert J.P. Savelsburgh,

Han van der Maas, and Paul C.L. van Geert, (Editors), Non-linear Developmental Processes, pp.           53-78. Elsevier, Amsterdam, 1999.

 

Port, Robert & Timothy van Gelder, editors. Mind as motion: Explorations in the dynamics of cognition.

Bradford Books/MIT Press, 1995

 

Port, Robert F.. Linguistic timing factors in combination. Journal of the Acoustical Society of America,

69:262-274, 1981.

 

Saltzman, Elliot and Kevin Munhall (1989) A dynamical approach to gestural patterning in speech production.  Ecological Psychology 1, 333-382.

 

de Saussure, Ferdinand.(1916). Cours de linguistique générale. C. Bally & A. Sechahaye, Paris.

 

Scheutz, Matthias (1999) The Missing Link : Implementation and Realization of Computations in Computer and Cognitive Science.  Unpublished dissertation, Cognitive Science and Computer Science, Indiana University.

 

Selkirk, Elisabeth. The syllable. In Harry van der Hulst and Norval Smith, editors, The Structure of

Phonological Representations, Part 2, pp. 337-383. Foris Publications, Dordrecht, 1982.

 

Tajima, Keiichi and Robert Port. Speech rhythm in English and Japanese. In John Local, editor, Papers in

Laboratory Phonology VI. Cambridge Universty Press, Cambridge, 1999.

 

Tajima, Keiichi, Bushra A. Zawaydeh, and Mafuyu Kitahara. A comparative study of speech rhythm in

Arabic, English, and Japanese. In Proceedings of the XIVth International Congress of Phonetic

Sciences, pages 285-288, San Francisco, CA, 1999.

 

Tajima, Keiichi. Speech Rhythm in English and Japanese: Experiments in Speech Cycling. Doctoral

dissertation, Indiana University, 1998.

 

Treffner, Paul and Michael T. Turvey (1993) Resonance constraints on rhythmic movement. J. Expl Psych: Human Perception and Performance 19, 1221-1237.

 

van Gelder, Timothy, & Robert Port. It's about time: Overview of the dynamical approach to cognition. In

Robert Port and Timothy van Gelder, editors, Mind as motion: Explorations in the dynamics of cognition, pages 1-43. Bradford Books/MIT Press, 1995.

 

van Santen, J.P.H. (1996). Segmental duration and speech timing. In Yoshinori Sagisaka, Nick Campbell, & Norio Higuchi (Editors), Computing prosody: Computational models for Processing Spontaneous Speech. Springer Verlag, New York.

 

Wenk, B. J. & Wioland, F. Is French really syllable-timed? Journal of Phonetics, 10:193-216, 1982.

 

 

 



[1] The term dynamic can be used in many ways. Here, we mean a system where quantitative variables follow rules in quantities of time.