Speech
Timing and Linguistic Theory
by
Robert F. Port and Adam P. Leary
Department
of Linguistics, Indiana University
Bloomington,
Indiana, 47405
port@indiana.edu,
adamlear@indiana.edu
July 05, 2000
It
is argued that the traditional computational or symbolic view of
cognition has little choice but to assume an apriori phonetic space of
discrete, static and serially ordered atomic symbols, as assumed
explicitly by Chomsky and Halle (1968) and others.
Then it is shown that there are several, well-supported cases
where these assumptions are shown to be false. First, there is a
durational pattern observed in the English voicing contrast in
syllable-coda position (e.g., lab/lap) where evidence shows that
the relative duration of two intervals (the vowel duration and
following stop or fricative duration) is a fundamental cue for the
value of the voicing feature. Since this durational ratio cannot
plausibly be assigned to a universal property of phonetic
implementation, the durational ratio must be described as a property of
English phonetics or phonology in violation of the Chomsky-Halle
assumption of universal, static phonetic features.
Second, the incomplete neutralization of the voicing contrast
in syllable-final position in German (where Bunde-bunte become Bund-bunt
with near-but-not-perfect neutralization to the voiceless category)
shows that the discreteness property also may also fail to hold.
These results imply that language processing is achieved by a
system that may prefer discrete, symbol-like units, but does not
require them. Recent
developments in models of speech perception and production processes
dependent on dynamical systems, such as those of Grossberg and
colleagues, exhibit the appropriate characteristics to serve as a
psychology upon which a psychologically sound linguistic theory can be
constructed that does not require assuming that human language is an
instance of a mathematical or computational system.
Section
1: The problem of time in language.
No
one would deny that speech is produced in time, that is, that the
sentences, words, consonants and vowels of human language are always
extended in time when they are uttered.
Still, many - and probably most - linguists would argue that
temporal extension is not an intrinsic property of natural language and
that the temporal patterns of language (other than those representable
in terms of serial order) should not be expected to be relevant or
revealing about language itself. This
might surprise many nonlinguists, but
it
reflects a fundamental and far-reaching assumption about language and
even about general human cognition.
Linguists tend to assume that the temporal layout of speech is
a property that is imposed on language from the outside at the point
where the logically static and serially ordered structures of the
language itself are performed by the human body. That is, linguists
typically assume that linguistic competence contains only static
structures. The
tree-shaped patterns graphed on the pages of linguistics journals and
serial lines of printed text like the ones you are reading on this page
are, in essential respects, good models of our actual cognitive
representations. The cognitive form of language is said to have
serially ordered discrete words composed from a small inventory of
meaningless sound-related segments, just like a printed page.
And a tree diagram represents the same logic as the phrase
structures of any sentence.
These
cognitive symbol strings may be `implemented' in time by the linguistic
`performance' system if and when linguistic structures happen to be
spoken. From the standard linguistic perspective it is the serial and
hierarchical structure of language that reveals its true form. One
might say that speech
is language plus the filtering effects of the performance system
that maps language into speech. From
the traditional point of view, speech performance is thus derivative.
It is merely one possible `output mode'-- it is just one of several
ways (along with writing) to get language from the mind out into the
body and the world. Speech,
on this view, just happens to impose time on the fundamentally
nontemporal structure of the language itself.
This
point of view is basic to all 20th century structuralist
views of language (de Saussure, 1916; Bloomfield, 1926; Hockett, 1954)
including the Chomskyan paradigm (Chomsky, 1965; Chomsky-Halle, 1968).
It assumes a fundamental distinction between the Formal and the
Physical. On one hand, there is a Formal (or mental) World - the
Competence World, where the serial order of hierarchies of timeless
symbols provide the data structures of natural language.
Formal operations apply to these data structures sequentially
just as they apply in a derivation in formal logic or mathematics.
Given the formal nature of the structures involved, any time
interval that might happen to be required for the operations to take
place is merely epiphenomenal. It is not directly relevant to the
formal operations themselves. And on the other hand, there is a
Physical World of brains and bodies belonging to speakers and listeners.
From the traditional perspective, the time-free structures of
language are actually ``implemented'' and processed in time (see
Scheutz, 1999 for further discussion of implementation). From the
perspective of traditional linguistics, such implementation processes
may hold some interest, but they are in no way the natural home of
human language and, in any case, are not directly relevant to it.
We
believe this point of view is deeply mistaken. Why? Although there are
many reasons (see Port & van Gelder, 1995), we will discuss just
two. The first is that the
dichotomy of competence and performance, of mind and body, of formal
and physical, creates a gulf that, once postulated, turns out to be
impossible to span using the methods of empirical science.
This is surely one reason why linguists are not generally eager
to study the methods or results of experimental psychology, speech
science, neuroscience or even physics (since these time-dependent
fields are so difficult to reconcile with language as a pure symbol
system). They seem irrelevant. And correspondingly, this is why
scientists from other disciplines frequently have difficulty
understanding what linguists are doing.
Disciplines like neuroscience and much of cognitive psychology
(though not all) lie across the formalist gulf from linguistics.
Thus far, no satisfactory way to bridge this conceptual gap has
been found. If one assumes
that real cognitive events do not take place in space and time
and that real physical events do, there is no obvious way to get
them together.
To
appreciate this problem, the first step is to review some familiar
properties of a symbolic
system.
Although linguists, for example, assume the uniformly symbolic nature
of language at all levels from phonetic segments, to
phonological units, morphemes, words, phrases and sentences less
attention has been paid to what properties a symbol token must exhibit.
In western science, symbols are employed in at least these
three distinct domains: for doing mathematical reasoning, in
computer hardware function and as a theory of cognition.
Thus, for various types of mathematical reasoning, logic uses tokens
like p and q, and mathematics
might use x and integers.
In formal reasoning (like doing logical proofs or long
division, or writing a computer program, etc), operations are performed
on symbolic structures as executed by trained human thinkers.
When the steps are difficult and throughout the training of
practitioners, steps in a formal reasoning
process are typically supported by body-external props. That is,
extended formal reasoning depends on external `scaffolding (see
Clark, 1997) such as by writing physical symbol tokens on paper (and,
very recently, by using the support of programs running on digital or
analog computers). In
computer hardware, formal methods are automated in general terms by the
use of symbol tokens coded into bits in a digital computer. The third
domain for the employment of symbolic theory lies in a particular view
of various cognitive operations involved in human language and human
reasoning (Chomsky, 1965; Fodor, 1975; Newell & Simon, 1972). The
symbol tokens in language (and probably in general cognition) are the
words and sound components of a particular language.
As
clarified by Haugeland (1985), in order to function as advertised, the symbol
tokens
in any symbolic system must
be,
first of all, digital,
that is, discretely distinct from each other and reliably recognizable
by the available computational or cognitive equipment. This is
essential in order for the computational mechanisms to manipulate the
symbols infallibly during processing.
Second, the symbols used must
be either
apriori
or composed from apriori components.
Some set of apriori units must be available at the time
of origin of a symbolic system from which all further symbol structures
are constructed. In the case of logic or mathematics, an initial set of
specific units is simply postulated (e.g., the integers, proposition p or
the initial symbol S). In computing, the bytes execute
operations in discrete time, but the units and primitive operations
were engineered into the hardware itself, and are thus obviously apriori.
Human language and everyday reasoning are more problematic
since we dont know what they are in advance. Indeed, the
discovery of the list of innate primitive units is the one of the
primary targets of research in modern linguistics (e.g., Chomsky,
1965). (The discrete time assumption is almost never discussed.) Most
often linguists assume that the initial list of primitives includes at
least units like [Vowel], [Consonant], [Noun], [Past Tense], [Sentence]
and so forth.
The
third property of symbols, although one that Haugeland did not say much
about, is that they must be static.
Since symbolic or computational models always function in
discrete time, it must be the case that at each relevant discrete time
point (that is, at each tick of the discrete-time clock), all relevant
symbolic information is available. For example, if a rule is to apply
that converts apical stops into a flap, then there must be some time
point at which the features that figure in the rule, [+stop], [+voice],
[+apical] etc. are all fully represented and either are holding steady
or somehow are constrained to synchronize with each other while the
rule applies in a single step of discrete time. Thus, properties in a
symbolic system cannot unfold gradually in continuous time but must, at
the relevant clock tick, have some discrete symbolic value.
(Of course, there is nothing to prevent simulation of
continuous time with discrete sampling, but this is not what the
symbolic, or computational, hypothesis about language claims.)
Now,
how can symbolic units with these properties be naturalized and sought
in a human brain? Formal symbols assume some very nonbiological
properties. It is one thing for humans to manipulate such units in a
deliberative way leaning on the support of paper and pencil so each
step can be written down and checked for accuracy, and for computers to
employ specialized discrete-time hardware to process symbolic
structures. But it is
another matter altogether to casually assume that real formal symbol
structures are actually processed in a discretized version of real time
by human brains. The problem is that if we study language as a facet of
actual physical human beings (rather than as a particular instance of a
discrete-time `Platonic heaven) then its processes and its
products must have some location and extent in real time and real space.
Truly natural language must be visible and accessible to
scientific research methods that investigate events in space and time
real events in real time. If
there is temporally discrete behavior of the human brain (which is
certainly quite possible), then we must surely study this empirical
phenomenon directly by gathering our data in continuous time in
order to discover just where temporal discreteness can be observed and
how the discrete-time performance is achieved.
Simply assuming there is a sharp apriori divide between
language as a uniformly serial-time structure and speech as a real-time
event (as claimed in Chomksys competence/performance
distinction), is a very risky assumption to make.
And, in our view, there is a fair amount of evidence that it is
simply false.
The
second reason for rejecting the view that language is essentially
formal is that it seems clear that, from a biological viewpoint,
language is fundamentally and essentially a spoken medium not
written one. Contrary to the practice of linguistic theory, it is the
written and orthographic versions of language that are derivative from
speech. It is especially
in written language where the symbol-like characteristics like
near-discreteness, timelessness and closed inventories of symbol
tokens are most pronounced. Yet written (and edited)
language is based upon historically recent, culture-dependent methods
(dating back only a few thousand years), using cognitive processes that
are partly dependent on literacy, logic and mathematical
generalization. Even today fewer than half the human population is
literate and only a minute fraction could appreciate the meaning of a
diagram of the structure of a sentence or a syllable even if one spent
some time explaining these images to them.
These
seem to us to be real problems for the traditional view ones
that cannot simply be brushed off with assertions like ``We dont
really know much about how the brain works anyway.
For example, if every human utterance exists only by being
built from clearly discrete building blocks, that is, if linguistic
expressions are always discrete structures assembled from linguistic
atoms (rather like the words and letters on this page are composed from
inventories of very limited size), then why isn't it always similarly
transparent to speakers (as well as to linguists) what the data
structures actually are for any utterance in any language?
Of
course, we agree that in many specific cases it is intuitively
very clear to speakers what the components are. For an English sentence
like `Don kicked that', the morphemes are transparently just the
set Don, kick, -ed and that.
And each of the phonemes that spell the cognitive
representation of these morphemes seems to be a simple, atom-like sound
unit [dankikt...].
Furthermore, at the syntactic level as well, there is a fairly clear
discrete phrase structure that represents [Don] as a unit in
parallel with [kicked that] (serving as subject and predicate
respectively), and we intuitively appreciate that the bond between, [kick]
and [ed] is much tighter than the bond between [kicked]
and [that].
But
every linguist also knows that very often it is not obvious how
many parts there are in a stretch of speech nor exactly what the parts
are. Thus, it's not clear how many morphemes there are in such a simple
word as `strawberry', or how many phonological parts there are
in `chive'. The
piece `-berry' looks like an obvious morpheme, but what about `straw-'?
It doesn't seem to be related semantically to `wheat straw' or `soda
straw' nor to
contribute any specific meaning except as a kind of diacritic on `-berry'
(to keep strawberry distinct from raspberry and
elderberry). But if straw
is only a diacritic, then why chop off berry in the first place?
Perhaps strawberry should be monomorphemic
(and listed independently in the cognitive lexicon) despite the
fact that berry also occurs both as a single word and as part of
other words with related meanings.
But, this solution ignores the generalization that the berry
in strawberry looks like, and means roughly the same as, the
isolated word berry. So should this redundancy be avoided? Well,
it depends on the analysts theoretical assumptions about
grammars, of course.
And
is `chive' made of 3 cognitive sound units, or 4 or 5 (since the
initial consonant and the medial vowel have either two distinct parts
or smooth gliding motions both acoustically and articulatorily)? Such
problem cases occur frequently in every language.
It is often very difficult to parse phrases, isolate the
segmental phonemes, the morphemes and other units from spoken or
written language. In these
challenging but very typical cases the morphological analysis is not
obvious at all. Instead, linguists must bring some theoretical
assumptions to bear in order to justify one analysis over others. Two
of the very first assumptions seem to be the Symbolic Language
hypothesis (SL) and the Atomic Inventory hypothesis.
1.
The
Symbolic Language Hypothesis (SL):
Every utterance in every language is constructed entirely of discrete,
timeless symbols. Complex units are composed
of combinations of simpler units in a series of levels.
2.
The Atomic Inventory Hypothesis (AI):
All languages select their phonological (and other primitive) symbols
from a universal discrete set a set that is not very large.
All
significant differences within the sound systems of any language as
well as all differences between languages are supposed to be
representable in this alphabet of symbols. Languages, it is claimed,
can never differ in the continuous-valued implementation of these
minimal symbols. If they
could, then languages would not be entirely symbolic.
(Notice that English orthography, like most other alphabetic
orthographies, explicitly embodies a related assumption since all the
words in correct sentences are constructed from a set of 26 letters and
the words are supposed to be selected from a standard dictionary of
vocabulary items. Much
more generally, linguists assume these properties are true of spoken
language as well.) For
spoken language the discreteness of phonological atoms, that is, of the
segmental distinctive feature vectors, guarantees the discreteness of
all other linguistic units spelled from them.
Of
course, it is possible that the SL Hypothesis is correct, but what if
it turns out to be partly false? What if words and other apparent
linguistic units are sometimes only nondiscretely different from
each other (unlike printed letters)? And what if linguistic units like
words and phonemes are not always timeless static objects, but turn out
to be necessarily, essentially, events in time? If (a) genuine category
nondiscreteness exists, or if (b) there exist any linguistic structures
that are essentially temporal (as opposed to merely `implementationally
temporal), then the bold symbolic assumption would be seriously
compromised. Certainly the
SL hypothesis would not seem to be so obviously correct that one would
feel justified in clinging to it as a reliable guide whenever
descriptive problems gets murky.
Instead,
we propose that the first steps should be taken toward a new discipline
of linguistics -- we might call it `Embodied Linguistics
(even while risking scorn for such a trendy term!). Step one
must be to naturalize language and fit it into a human body, that is,
first of all, to cast it into the realm of space
and time.
The first step in doing this is to change our focus of attention from
the study of linguistic knowledge (normally conceptualized as
static and symbolic) toward the study of linguistic behavior and
performance, since behavior and performances exist in time.
We take this step, not because of assumptions about learning
(like the behaviorists did, Bloomfield, 1926), but simply because we
need temporal information -- timing data to discover how
the whole system really works. The
cognitive system is assumed to run in time and can only be understood
that way. The SL assumption deprives us of temporal data about human
language and cognition. Only
through the study of specific language performances under many
conditions can accurate understanding of the cognitive form of language
arise. Indeed, we will
present evidence in this essay of both the nondiscreteness of
particular linguistic patterns and also demonstrate one
language-specific pattern that is essentially temporal.
These results clearly violate the Symbolic Language hypothesis
and the Atomic Inventory hypothesis and support a view of language as
something that is essentially embodied. Lets turn to the evidence.
Speech
timing research in general.
Research
on speech production and perception has shown from the earliest era in
the mid-50s that manipulation of aspects of speech timing could
influence listeners perceptual experiences.
These aspects include the slope of energy onset in fricatives,
the duration of acoustic intervals corresponding to consonant closures
and vowels, the slope of formant frequencies and so on (see reviews in
Lehiste, 1970; Klatt, 1976). For
example, chopping off the very front of
the [S]
in `shore with a waveform editor, so that the noise onset
is very abrupt rather than gradual, creates the perceptual experience
of an affricate, as in `chore'. Shortening the duration of a
final [s] in `hiss (keeping the offset of noise intensity
gradual by editing from the middle of the [s]) can create the
perceptual experience of `his' for English speakers. (Klatt,
1976, reviewed many of these perceptual phenomena.) Of course, for
languages with long and short vowels, like Hungarian [tør] tör
break vs. [tø:r] tQr
dagger, if you lengthen a short one, it sounds like a long
vowel and if you shorten a long one, it sounds like the word with a
[short] vowel (for Hungarian listeners).
Clearly listeners in these cases are using some kind of
measurement of onset slope and fricative or vowel duration extracted on
the fly to make these judgments. Similarly,
speakers are controlling these properties during their production.
Do these phenomena pose a challenge to the traditional theory?
How could the traditionalists maintain that serial order is all
that is linguistically relevant about the time axis?
Their answer is that there is both linguistic Competence,
the cognitive symbolic system, and Performance, the system of
psychological and physical constraints.
Traditionalists
argue that sensitivity to these temporal patterns - even if they are
exploited by listeners in speech perception are not evidence
about the language itself (Jakobson, Fant, & Halle, 1952; Halle
& Stevens, 1980; Chomsky & Halle, 1968, pp. and see Lisker
& Abramson, 1971). Thus,
in its symbolic form, the Hungarian words tör vs. tQr,
for example, might differ in either the number of segments (that is, 1
/ø/ vs. 2 /ø/s) or possibly in whether a single vowel
segment has the feature [+long] or not (that is, [`,
-long]
vs. [Q, +long]). The actual implementation in time is
interpreted to be a consequence of universal processes of
interpretation of the symbol external to the linguistic description per
se. This is a reasonable
story. However notice that
this view still rests on an important theoretical claim about the
phonetic implementation, one that may prove vulnerable.
The traditional account of speech timing is theoretically
tenable only given the assumption that the phonetic implementation
processes are universal. Only if phonetic implementation of discrete
phonetic symbols works the same way for all languages could it be also
true that linguistic utterances are composed entirely of symbols and
differ from each other only in symbol-sized steps.
Since it is assumed languages cannot differ from each other in
their phonetic implementation methods, the phonetic implementation must
be universal.
The
phonology is supposed to embody the language-specific properties of
speech, while the phonetic inventory and its implementation is
universal. It has to be, if the phonetic space is to include all the
phonetic capabilities of the human species.
The fact that every language depends on the same phonetic
inventory provides a key part of the Chomskyan account of why children
can learn language so quickly (since the universal phonetic
representation supports their ability to listen to and imitate their
mothers speech). This universal space is defined by a specific
inventory of symbolic phonetic units that have serial order as their
only version of the time scale (Chomsky & Halle, 1968).
Now
what if it could be shown that what distinguishes a pair of words in
some language (or a class of words that share a feature contrast) is intrinsically
a temporal property, rather than something intrinsically
nontemporal that happens to be implemented in time?
We have seen that the simple fact of differing in duration is
not sufficient evidence since implementation rules might account for
the duration effect. But, imagine a case where two sound classes differ
from each other in some particular durational relationship and the
duration of some acoustic segment (or articulatory segment) to some
adjacent acoustic segment has to be, let us suppose, near some
particular durational ratio.
If such a case could be found and if it could also be shown, in
addition, that the temporal ratio shows no indication of being an
implementational universal, then such a case would be strong evidence
that temporal constraints can be intrinsic to specific languages. A
static, durationless symbol would serve as a very poor model to
describe such a situation. Most importantly, it would call into
question the conventional isolation of time from linguistic cognition
and would imply that the phonological grammars of particular languages
incorporate some temporal properties.
Section
II. Germanic Voicing (or
Tensity) Contrast: Temporal Aspects.
One
of the best examples of such a language-specific temporal pattern used
for phonological purposes is the contrast in English among stops and
fricatives between those transcribed with /b, d, g, z,
/ and
those transcribed with /p, t, k, s,
/.
That is, we focus on a `feature' or a class of contrasts in the
coda position of a syllable (that is, near the end of a syllable after
the main vowel). In English, pairs of words like `lab-lap,
build-built' and `rabid-rapid' contrast in this feature, as
do German Bunde-bunte (club-Plur,
colorful- Nom., Sing.),
Swedish bred-brett (broad-Basic, broad-Neuter) and
Icelandic baka-bakka (to
bake, burden-Acc). This
type of pattern was probably a characteristic of the ancient Germanic
proto-language of 2-3 thousand years BP and has been inherited in
somewhat different form by most modern Germanic languages.
The contrast is between pairs of stops and fricatives, like
/z-s, b-p, d-t/, etc. at the end of a syllable or between syllables.
Because of the importance of several different articulatory and
acoustic correlates of this feature, there has been some dispute over
the years as to the proper characterization of the difference - whether
it is essentially a feature of `Voicing' (that is, glottal closure and
pulsing) or some complex of other factors usually summarized in the
term `Tensity', where /s,p,t,
/ etc. are [+tense] and
/z,b,d,
/ etc. are [!tense]
or [lax].
It
is clear that one characteristic of this contrast is that it depends
significantly on timing to maintain the distinction.
In particular, for the words with Voiceless consonants, like
English `lap' and `rapid', the preceding vowel is shorter
while the stop closure is longer in the /p/ words (eg, `lap, rapid')
relative to the corresponding words with /b/ (e.g., `lab, rabid')
(Peterson & Lehiste, 1960; Lisker, 1985; Port, 1981).
Of course, since speakers typically talk at different speaking
rates, the absolute durations of vowels and consonants are highly
variable measures in ms. It is not the case that absolute durational
values (eg, in milliseconds) are employed by listeners (since in that
case they would both produce and perceive
more /p/s and /s/s at slow rates and more /b/s and /z/s at faster rates
but they do not). Instead,
if all other properties are kept constant, the ratio of durations tends
to be much less variable than, e.g., the absolute durations in ms, as
shown in Figure 1.
Perceptual
experiments with synthetically constructed speech or experimentally
manipulated natural speech confirm that it is the relative durations
that determine judgments between minimal pairs like /lab-lap/ and
/rabid-rapid/ whenever other cues to the `tensity' feature are
ambiguous (eg, Lisker, 1985; Port & Dalby, 1982).
Port has called this relationship ``V/C ratio - the
relative duration of a vowel to the following obstruent constriction
duration. This ratio is
smaller for the [!voice]
(or [+tense]) consonants than for the corresponding [+voice] (or [!tense])
consonants. The V/C ratio
is relatively (though not completely) invariant across changes in
speaking rate, syllable stress and segmental context (Port, 1981, Port
& Dalby, 1982).

|
|
|
Figure 1. Stimuli and results from
Port & Dalby (1982) study of consonant and vowel timing as cues for
voicing in English. The top panel shows sound spectrograms of two of
the synthetic stimuli employed. These show the shortest
(140 ms) and longest (260
ms) vowel durations for dib. For each vowel duration step, nine
different silent medial-stop closure durations were used (= total of 40
stimuli). Subjects were asked to identify them as dibber or dipper.
The lower panels show the percent identification as dibber as a
function of medial stop closure duration (on the left) and
(on the right) as function of consonant/vowel ratio (where the
stop duration of each stimulus is divided by its vowel duration).
Looking at the perceptual boundary
(50%-50% identification) note the much tighter clustering in
right panel around a C/V ratio of .35.
|
In
several other Germanic languages, similar measurements of speech
production timing and perceptual experiments
using manipulations of V and C durations have shown that listeners pay
attention especially to the relative duration of a vowel and the
constriction duration of a following obstruent, that is, stop or
fricative (Port & Mitleb, 1983; Pind, 1995).
So apparently this durational pattern is important for the
production and perception of these Germanic languages.
Could
the V/C Durational Ratio be a Temporal Universal?
The
first thing to check when a durational invariant is found is to check
whether this timing difference reflects universal and unavoidable
articulatory correlate of a contrast in, say, glottal pulsing during a
consonant constriction. If
this turned out to be the case, it would support the standard view that
the phonology, specified in terms of patterns of universal phonetic
features, is the locus of all differences between languages. `The
phonetic capabilities of man', said Chomsky and Halle, are an inventory
fixed for, at least, historical time. This species-wide alphabet is
said to facilitate rapid language acquisition by infants and to assure
that speakers of different languages can learn each other's language
(given enough time). So
for this temporal pattern to show evidence of being universal, either
in association with some segmental feature or not, would fit the view
that it is only in the choice and deployment of static features that
languages may differ from one another.
Any temporal differences that might show up can only occur as
the result of the speech production or perception apparatus, which is,
by hypothesis, a universal.
Phonetic
Implementation.
To deal with these phenomena for speech production within the
symbolic view of language, one might postulate an implementation system
(outside language) that takes a (universal) alphabet of phonetic
symbols as input and translates them into output gestures with
particular temporal constraints. Individual features will differ in
their effects on the temporal behavior of the output device.
In this case, the implementation rule would have to assure that
the ratio of the vowel duration to the following consonant constriction
is about 1.5 or greater for the +Voice case (so the vowel is quite a
bit longer than the consonant constriction), and closer to 1 for the
Voice case (so the V and C are about the same duration).
Accounts along this line have been proposed for similar
phenomena by, for example, Chomsky and Halle (1968, Chapter 10), Halle
and Stevens (1980), Klatt (1976), Port (1981) and others.
To
see how such implementation systems might work, we will look more
closely at Klatts (1976) model (also Port, 1981).
Klatt observed, first, that although many factors might
influence the duration of a vowel, the effects of each factor depend in
part on how many others apply at the same time.
For example, comparing the duration of the vowel in rabid
vs. rapid (where the voicing feature will make the vowel
shorter in rapid) with a similar difference in lab vs. lap,
the vowel might be 12% longer in rabid than rapid but 18%
longer in lab than lap.
The reason for the difference is that the vowel is overall
longer in both lab and lap than in rabid and rapid
due to the presence of an additional syllable in the rabid-rapid
pair. This nonlinearity
led Klatt to propose that each vowel might have a minimum duration
beyond which it could not be compressed. Then he applied constant ratio
(that is, linear) timing rules to the remainder.
Thus each vowel had an `inherent duration for
[æ], lets suppose, 100 ms observed from monosyllabic words
with final voiced stop (as in lab or lag) spoken in
isolation, and also a minimum duration estimated as 65% of the
inherent duration. Then constant ratio rules would lengthen or shorten
the remaining interval (here 35 ms) between the inherent duration (here
100 ms) and the minimum duration. Thus,
to implement the vowel duration in, say, lacky, we take the 35
ms (inherent minus minimum) and multiply by 0.6 (to shorten for the
following voiceless stop) and then by 0.7 (to shorten for the second
syllable). Then we add the
resulting 14.7 ms (= 35*0.6*0.7) and add it to the minimum duration.
This gives a target duration for [æ] in lacky or
backing or rapid of 79.7 ms (± noise). By this method,
every segment either has its `inherent duration or one that is
derived by such temporal implementation rules depending on the context.
Leaving
aside the issue of how accurately this way of specifying the rules
works (of course, there are several other proposals for how to write
these rules, e.g., van Santen, 1996), there are many reasons why this
entire approach is implausible. The first problem is the question of
what use these durations in milliseconds might be? Who or what will be
able to use these numbers to actually achieve a target duration of 79.7
ms for this vowel? There
is no existing model for motor control that could employ such
specifications. We need
another theory of motor control to make use of these specs
and attempt to generate a vowel of n milliseconds duration (see
Fowler, Rubin Remez, & Turvey, 1981; Port, Cummins, & McAuley,
1995). Second, durations
in milliseconds seem fundamentally misguided since speakers talk at a
range of rates. So it seems that it should be relative durations that
are employed, not absolute durations (see Port, Cummins, & McAuley,
1995). Third, such a system has no apparent way for global timing
patterns (eg, regular stress timing) to influence timing.
This model computes a duration for one segment at a time only.
Longer intervals get their duration just by adding up the individual
segments that comprise it. Finally,
fourth, if the duration pattern for the Germanic voicing contrast is
really the relative duration of a vowel to
the following consonant, this kind of model will have
difficulty. The vowel duration effect of the voicing is computed by a
context-based rule like the one above, while the stop closure is just
the inherent duration of the following consonant. It is difficult to
see how a rule-governed duration and an inherent duration could be
coordinated to assure a particular durational ratio since the ratio
itself plays no role in the rules.
Despite
these implausible features, it is difficult to prove the impossibility
of such an account. After all, if formal models can simulate a Turing
machine, they are very likely to be able to deal with such relational
temporal phenomena by some brute-force method. But an
implementational solution along this line is only interesting if
certain very specific constraints are applied to the class of
acceptable formal models, as Chomsky has frequently pointed out (1965).
And, if one can always add additional phonetic symbols to the universal
set and apply as many rules as you please, then it could be claimed
that only the Germanic language group happens to employ a particular
feature that is universally implemented in this temporal way (even
though no nonGermanic languages have been observed to exhibit it).
But to always permit postulation of new symbols for every new
temporal effect is surely not sound scientific method.
Yet,
short of proliferation of unique features, an implementation rule for
the Germanic voicing effect cannot be universal.
Most languages in the world (including, e.g., French, Spanish,
Arabic, Swahili, for example) do not exploit the relative duration of a
vowel to the following stop or fricative constriction as correlates of
voicing (Chen, 1970; Port, Al-Ani, & Maeda, 1980).
We know from classroom experience that if you play English
stimuli varying in vowel and/or stop closure duration -- stimuli that
lead native English speakers along a continuum from rabid to rapid
-- those stimuli with varying V/C ratio will tend not to change
voicing category at all for French, Spanish or Chinese listeners. Their
voicing judgments are almost completely unaffected by V/C ratio. They
just pay attention to glottal pulsing during the constriction. Such
durational manipulations may affect the naturalness of the stimuli, but
do not make them sound more Voiced or less Voiced.
The
most plausible conclusion to be drawn from this situation is that the
Germanic languages share the property of manipulating V/C ratio for
distinguishing classes of words from each other.
English listeners, at least, make a categorical choice between
two values of a feature that might be described as `Voicing' (or as
`Tensity' or `Fortis/Lenis'). But there is nothing universal about this
property. It just happens to be a way that one family of closely
related languages controls speech production and speech perception to
distinguish vocabulary items. Thus,
we have an undeniable temporal pattern, one that requires some rather
specialized machinery to produce or perceive, yet which apparently must
be a learned property of the phonological grammar of specific languages
as a `feature for contrasting sets of words. To call this
distributed temporal pattern a `symbol, that is, a static token,
is to make it impossible to see what it really is an
intrinsically temporal pattern that is part of the specification of
words in this group of languages.
There
are other well-known timing effects as well that are related to the
prosodic patterns of languages. Kenneth
Pike first commented in 1945 that the languages of the world seem to
fall into two general types as far as their prosodic pattern is
concerned stress-timed
languages and syllable-timed languages. In modern
linguistics, prosodic timing is most often viewed in terms of this
dichotomy. The taxonomy rests on the impression that languages seem to
have distinct rhythmic styles. Languages like English and Russian seem
to be stress-timed, since stressed syllables have a tendency to be
regularly spaced in time, whereas in French and Spanish, it seems that
each syllable is trotted out in a regular rhythm. This is based on
auditory impressions of regular time intervals, as, for example, by
tapping a finger on a table (Jones, 1918/1932; Pike, 1945; Abercrombie;
1967). The taxonomy
implies a hypothesis of isochrony, or equal spacing in time.
Impressionistically, equal time intervals seem to apply to the level of
stressed syllables in English and Russian. Thus, if we say, eg, `He
EATS poTAtoes toDAY, the stressed syllables seem equally
spaced. One could tap a finger for each one. But if we say `Hes
EATen the poTAtoes toDAY, it seems like the timing is the
almost same even though there are now two additional unstressed
syllables inserted between EAT- and -TA- (especially if
you tap your finger on each stress). On the other hand, in French, for
example, if we say `Je ne parle pas francais, it seems
like a finger could be tapped for each of the 5 syllables, and that all
are about equally spaced. Pike,
seconded by Abercrombie, suggested that all languages might fall into
one type or timing pattern of the other.
When
careful instrumental measurements are used, however, it is typically
found that neither inter-stressed intervals nor inter-syllable
intervals are actually produced isochronously (Classé, 1939;
Bolinger, 1965; Pointon, 1980; Wenk & Wioland, 1982; Tajima, 1998).
Of course, it should not be a surprise that perfect
isochrony is not found. That is a very challenging test to meet.
Naturally, lacking a realistic theory of timing and given such
a difficult test to meet, experimental studies have repeatedly failed
to support hypotheses of isochrony. The question is ``Just how
regular does the timing have to be to support such a
hypothesis? No
answer was available.
Furthermore,
investigation of the mora in Japanese during the 1980s called
into question the original simple taxonomy. A third type of timing
pattern was revealed. In Japanese, a mora can be a simple CV syllable,
like -ta-, but syllables with a long vowel, like too-
or a final consonant, like han-, count as two moras.
Thus a word like, say, Honda
has 3 moras, just like Fujita. And Tookyoo has 4
moras like Fujiyama. In the Japanese syllabic writing system
(that is, in the hiragana and katakana scripts), each
mora is written with a single symbol.)
The traditional description of the mora found in Japanese
pedagogy holds that it is an isochronous unit that
all moras have equal duration (so that Honda and Fujita
should take the same amount of time to pronounce). Despite
experimental support for the mora as an isochronous temporal unit (Han,
1962; Port et al., 1980; Homma, 1981), it has remained somewhat
controversial among phoneticians (Beckman, 1982).
Disagreements
about the mora arose primarily from whether each individual mora is the
same duration as each other mora -- so that any compensation for
inherently longer or shorter consonants and vowels is achieved within
the mora (Beckman, 1982)--
or whether compensation may occur in neighboring moras. In the latter
case, the regularity of mora timing should be observable only by
looking at a window of several moras, e.g., a whole word or
phonological phrase (Port et al, 1980, Port et al, 1987; Han, 1994).

Figure 2. Five native Japanese speakers (Port et al., 1987) read words from 6 word groups (illustrated in Table 2). The word durations are pooled across subjects according to the number of moras. Each word group is labeled by the initial mora (so ka is the first mora in all tokens with that group). As the number of moras increases the word duration increases linearly in both fast and slow tempos. These data show that word duration is almost completely determined by the number of moras, not the number of syllables. Other experiments show that if a mora is inherently short, then segments in the preceding and following syllables stretch to compensate.
|
Number
of moras
|
Word
|
English
gloss
|
|
HI
group
|
|
|
|
1
|
hi
|
sun
|
|
2
|
hika
|
subcutaneous,
hypodermal
|
|
3
|
hikka
|
faux
pas,
misstatement
|
|
4
|
hikkaku
|
to
scratch, claw
|
|
5
|
hikkakeru
|
to hang,
hook
|
|
6
|
hikkakenai
|
not to
hang
|
|
7
|
hikkakerareru
|
get hung
|
Table 1.
An example of number of moras for one group of words used in Port et
al. (1987). The other groups followed a similar pattern of stacking up
moras. The apostrophe indicates a pitch accent in the utterance.
In the Port et al. (1987) it was found that the duration of whole words with the same number of moras had nearly the same duration no matter what their segmental content or number of syllables (Figure 2 and Table 1). Changes in speaking rate resulted in the average mora duration shortening. Other experiments show that within a word, neighboring segments within the mora as well as adjacent moras expand and compress in compensation for their neighbors, as shown in Figure 2. Other experiments have shown that perceptual segmentation of speech based on mora units is also found in prelexical processing in Japanese (Otaka, Hatano, Cutler & Mehler, 1993; Cutler & Otake, 1994) suggesting that the mora is highly salient as a unit in the production and perception of Japanese. So, if measurements are done just the right way, Japanese does show a type of ioschrony even if it is neither syllable timing nor stress timing. More rigorous methods are surely needed for studying English and other languages. No one should have expected absolutely equal durations for any units in speech. Furthermore, English speakers can sometimes exhibit apparent syllable timing, for example in certain styles of babytalk. So a more flexible experimental and theoretical framework was needed to explore these issues without assuming that only one timing principle can apply to each language.
Speech
Cycling Experiments.
The
idea of speech cycling evolved from consideration of research
methodologies employed in studies of limb motor control by studying
behavior of the limbs when two periodic patterns are interfering with
or interacting with each other (e.g. , Haken, Kelso, & Bunz, 1985;
Kelso, 1995; Treffner & Turvey, 1993). The idea of speech cycling
is to study the timing of repetitive speech when interacting with, or
coupled with, an external auditory metronome (Cummins & Port, 1998;
Cummins, 1997; Tajima, 1998; Port, Tajima, & Cummins, 1999; Tajima
& Port, in press). Subjects
repeat a short piece of text in time with a computer controlled
metronome-like pattern. It turns out that there are strong constraints
on the method of performing these tasks, some of which are apparently
universal, but others of which characteristically differ from language
to language. If it is true
that speakers of different languages perform such tasks in ways that
differ between languages, this would be further evidence that the
actual `grammars employed by native speakers include temporal
characteristics.
So,
for example, notice that if English speakers repeat a phrase like `Take
a pack of cards, Take a pack of cards, Take a pack of cards,
at a rate of
about 1 repetition per second, they are likely to pronounce it putting
a pitch accent on both take and cards.
The pattern of timing of the beats for the these two pitch
accents is very likely to be one or the other of two rhythms -- either
a slow two beats (as in ``1, 2; 1, 2; TAKE a pack of CARDS; TAKE a pack
of CARDS) or else in a fast three-beat pattern (1, 2, 3; 1, 2, 3; TAKE
a PACK of CARDS; TAKE a PACK of CARDS; TAKE a PACK of CARDS).
The 2- and 3-beat harmonic timing patterns are probably
universal (since these meters show up in most musical traditions), but
locating the onsets of pitch-accented syllables at the beats may be an
idiosyncratic property of English.

Figure 1.
This figure demonstrates the use of speech cycling in Cummins &
Port (1998). In this study, a corpus of 30 short phrases with identical
prosodic structure was used. Each phrase was of the form
X for a Y like beg for a dime.
Subjects repeated these types of phrases to a metronome pattern and the beat,
or vowel onset location, was found for each target syllable. A
succession of 14 pairs of alternating high (H) and low (L) tones is
used in each trial. The interval from the high to low tone was fixed at
700ms (to keep speaking rate constant). The relative time of the low
tone to the next high tone varied randomly over the range .2 - .7
(observed phase = the duration in milliseconds between the H and L
tones divided by the duration from one H tone to the following one).
The subjects were asked to speak the phrases such that the first word
of each phrase (e.g., beg) lined up with the high tone, and the
last word (eg, dime) lined up with the low tone. The simplest
hypothesis is that there are no rhythmic constraints on speech
production. If this were true, subjects should have produced the onset
of the final stress (dime) wherever the L tone occurred in the
phrase cycle (from .2-.7). Instead, the histograms are strongly
multimodal, with three clear modes, near .33, .5, and .66, but subject
KA showed only two modes at .3, and .5. Thus the subjects could not
accurately perform the task but were strongly biased to locate the
final syllable onset at harmonic fractions of the complete H-H
cycle (that is, at ½, 1/3 or 2/3).
By
picking analogous text material in two languages and urging subjects to
adopt specific rhythmic patterns, it is possible to compare the ease
and difficulty of various temporal patterns in two languages. For
example, Tajima & Port (in press) employed a three-beat
(waltz-like) speech cycling pattern to reveal distinct temporal
characteristics for English and Japanese. It was found that English
speakers show more `stress-timing effect that do Japanese
speakers in similar situations. Thus
the characterization of at least English and Japanese as stress-time
and mora-timed can be justified experimentally. Furthermore they found
evidence that individual English speakers can slip from timing that
favors equal timing for individual syllables to timing that is more
strongly concerned with the spacing of just the stressed syllables.
These
data strongly suggest that prosodic features of timing should be
included as part of a linguistic grammar if by a grammar, we
mean the idiosyncratic characteristics of each language that
differentiate it from others. One way in which languages differ, as
shown by speech cycling experiments, is in how syllables are identified
as `prominent (cf. Tajima, Zawaydeh & Kitahara, 1998; Cutler
et al. 1986). Since there are temporal aspects that differ from
language to language, the notion of a grammar must be expanded to
include such properties. The alternative, to leave timing out of our
understanding of language, is to underestimate what is involved in
speaking a language competently.
Even
leaving aside the issue of language differences in timing, the very
naturalness of speech cycling tasks
that is, the tendency, even without a metronome, to adopt a
rhythmical pattern in our speech especially when the speech is
repetitive reveals something fundamental about human language.
Like so many other human behaviors, speech is very naturally
and unavoidably entrained to periodic patterns. There are countless
examples of periodically structured speech: songs, auctioneering,
preaching, sports announcing, canonical babbling in infants, marching
chants, work songs, etc. (Port, et al., 1999; Ejri, 1998). Even
everyday speech tasks like reading a list of words aloud is enough to
strongly encourage regular speech periodicity (Leary & Muench,
1998). Like the speech cycling technique, these examples of everyday
speech remind us of the ubiquity of cyclic behavior: from breathing and
running to pacemaker cells in the heart. Therefore, is it surprising
that we find speech may exhibit timing similarities to other biological
and neurological functions? We
believe it is a good bet that many cognitive functions, both linguistic
and non-linguistic, will eventually be shown to entrainable to rhythmic
patterns at some appropriate rate.
To
acknowledge similarities in the timing of speech to timing found in
other biological systems (animal and human) implies a description of
speech that is dynamical and changes in time[1].
Formal logic and symbolic phonological accounts of prosody (Chomsky
& Halle, 1968; Liberman & Prince, 1977; Selkirk, 1984; Hayes,
1995) provide us with no basis for interpreting the temporal
characteristics of stressed and unstressed syllables (van Gelder &
Port, 1995; Port et al., 1999). Yet, as we have seen, speech prosody
does affect the timing of language in many important ways. So once
again, the strong tendency of speech to entrain to periodicity like
other motor actions is evidence supporting the view that the linguistic
structure is intrinsically dynamic, not symbolic.
We have pointed out
that certain phenomena are counterevidence for the Symbolic Language
hypothesis. One kind of counterevidence is demonstration that temporal
phenomena are sometimes intrinsic to language, as we have demonstrated
in the previous sections. A second kind of critical evidence would be a
convincing demonstration of patterns that are linguistically distinct
and yet not discretely different not different enough that they
can be reliably differentiated and yet are not the same either. This is
a difficult set of criteria to fulfill, but in fact such a situation
has been demonstrated in a number of experiments for several languages.
|
Position
of stop
|
German
word
|
|
English
gloss
|
|
Initial
|
der Back
|
|
mess table
|
|
|
der Pack
|
|
pack,
bundle
|
|
Medial
& final
|
Plural
|
Singular
|
|
|
|
Alben
[alben]
|
Alb
[alp]
|
elf
|
|
|
Alpen
[alpen]
|
Alp [alp]
|
mountain
pasture
|
Table 2.
Some examples illustrating the traditional data regarding
word-final devoicing in German. Using an atomic inventory
of static features, the following phonological rule is said to apply: [-sonorant]
[-
voice]/ ____$, where $ is a syllable boundary.
The
best studied case is in the near-neutralization of voicing in
syllable-final position in Standard German. Here, final voiced stops
and fricatives, as in Bund and bunt (`club, `colorful),
are said to neutralize to the voiceless case as shown in Table 2. That
is, although Bunde and bunte show that the words contrast
in the voicing of the apical stop, the pronunciation of Bund and bunt
seem to be the same. Both sound like [bUnt].
But the difficulty is that they are not pronounced exactly
the same (Port, Dalby & ODell, 1987; Port & Crawford,
1989). These pairs of
words, with final stops and fricatives, actually are slightly different
as shown in Figure 4. If
they were the same, then in a listening task you would expect 50%
correct (pure guessing like English too and two);
if different, you expect 99% or better correct identification under
good listening conditions (like German Bunde and bunte).
Instead, they are different enough that listeners can guess
correctly which word was spoken with only about 60-70% correct
performance (Port & Crawford, 1989)!
But such performance shows that the word pairs are neither
clearly the same nor clearly different. The voicing contrast is almost
neutralized in this context, but not quite. The differences can be
measured on sound spectrograms, but for any measurement one chooses
(vowel duration, stop closure duration, burst intensity, amount of
glottal pulsing during the closure, etc.), the two distributions
overlap considerably.
