Speech
Timing and Linguistic Theory
by
Robert F. Port and Adam P. Leary
Department
of Linguistics, Indiana University
Bloomington,
Indiana, 47405
port@indiana.edu,
adamlear@indiana.edu
July 05, 2000
It
is argued that the traditional computational or symbolic view of
cognition has little choice but to assume an apriori phonetic space of
discrete, static and serially ordered atomic symbols, as assumed
explicitly by Chomsky and Halle (1968) and others.
Then it is shown that there are several, well-supported cases
where these assumptions are shown to be false. First, there is a
durational pattern observed in the English voicing contrast in
syllable-coda position (e.g., lab/lap) where evidence shows that
the relative duration of two intervals (the vowel duration and
following stop or fricative duration) is a fundamental cue for the
value of the voicing feature. Since this durational ratio cannot
plausibly be assigned to a universal property of phonetic
implementation, the durational ratio must be described as a property of
English phonetics or phonology in violation of the Chomsky-Halle
assumption of universal, static phonetic features.
Second, the incomplete neutralization of the voicing contrast
in syllable-final position in German (where Bunde-bunte become Bund-bunt
with near-but-not-perfect neutralization to the voiceless category)
shows that the discreteness property also may also fail to hold.
These results imply that language processing is achieved by a
system that may prefer discrete, symbol-like units, but does not
require them. Recent
developments in models of speech perception and production processes
dependent on dynamical systems, such as those of Grossberg and
colleagues, exhibit the appropriate characteristics to serve as a
psychology upon which a psychologically sound linguistic theory can be
constructed that does not require assuming that human language is an
instance of a mathematical or computational system.
Section
1: The problem of time in language.
No
one would deny that speech is produced in time, that is, that the
sentences, words, consonants and vowels of human language are always
extended in time when they are uttered.
Still, many - and probably most - linguists would argue that
temporal extension is not an intrinsic property of natural language and
that the temporal patterns of language (other than those representable
in terms of serial order) should not be expected to be relevant or
revealing about language itself. This
might surprise many nonlinguists, but
it
reflects a fundamental and far-reaching assumption about language and
even about general human cognition.
Linguists tend to assume that the temporal layout of speech is
a property that is imposed on language from the outside at the point
where the logically static and serially ordered structures of the
language itself are performed by the human body. That is, linguists
typically assume that linguistic competence contains only static
structures. The
tree-shaped patterns graphed on the pages of linguistics journals and
serial lines of printed text like the ones you are reading on this page
are, in essential respects, good models of our actual cognitive
representations. The cognitive form of language is said to have
serially ordered discrete words composed from a small inventory of
meaningless sound-related segments, just like a printed page.
And a tree diagram represents the same logic as the phrase
structures of any sentence.
These
cognitive symbol strings may be `implemented' in time by the linguistic
`performance' system if and when linguistic structures happen to be
spoken. From the standard linguistic perspective it is the serial and
hierarchical structure of language that reveals its true form. One
might say that speech
is language plus the filtering effects of the performance system
that maps language into speech. From
the traditional point of view, speech performance is thus derivative.
It is merely one possible `output mode'-- it is just one of several
ways (along with writing) to get language from the mind out into the
body and the world. Speech,
on this view, just happens to impose time on the fundamentally
nontemporal structure of the language itself.
This
point of view is basic to all 20th century structuralist
views of language (de Saussure, 1916; Bloomfield, 1926; Hockett, 1954)
including the Chomskyan paradigm (Chomsky, 1965; Chomsky-Halle, 1968).
It assumes a fundamental distinction between the Formal and the
Physical. On one hand, there is a Formal (or mental) World - the
Competence World, where the serial order of hierarchies of timeless
symbols provide the data structures of natural language.
Formal operations apply to these data structures sequentially
just as they apply in a derivation in formal logic or mathematics.
Given the formal nature of the structures involved, any time
interval that might happen to be required for the operations to take
place is merely epiphenomenal. It is not directly relevant to the
formal operations themselves. And on the other hand, there is a
Physical World of brains and bodies belonging to speakers and listeners.
From the traditional perspective, the time-free structures of
language are actually ``implemented'' and processed in time (see
Scheutz, 1999 for further discussion of implementation). From the
perspective of traditional linguistics, such implementation processes
may hold some interest, but they are in no way the natural home of
human language and, in any case, are not directly relevant to it.
We
believe this point of view is deeply mistaken. Why? Although there are
many reasons (see Port & van Gelder, 1995), we will discuss just
two. The first is that the
dichotomy of competence and performance, of mind and body, of formal
and physical, creates a gulf that, once postulated, turns out to be
impossible to span using the methods of empirical science.
This is surely one reason why linguists are not generally eager
to study the methods or results of experimental psychology, speech
science, neuroscience or even physics (since these time-dependent
fields are so difficult to reconcile with language as a pure symbol
system). They seem irrelevant. And correspondingly, this is why
scientists from other disciplines frequently have difficulty
understanding what linguists are doing.
Disciplines like neuroscience and much of cognitive psychology
(though not all) lie across the formalist gulf from linguistics.
Thus far, no satisfactory way to bridge this conceptual gap has
been found. If one assumes
that real cognitive events do not take place in space and time
and that real physical events do, there is no obvious way to get
them together.
To
appreciate this problem, the first step is to review some familiar
properties of a symbolic
system.
Although linguists, for example, assume the uniformly symbolic nature
of language at all levels from phonetic segments, to
phonological units, morphemes, words, phrases and sentences less
attention has been paid to what properties a symbol token must exhibit.
In western science, symbols are employed in at least these
three distinct domains: for doing mathematical reasoning, in
computer hardware function and as a theory of cognition.
Thus, for various types of mathematical reasoning, logic uses tokens
like p and q, and mathematics
might use x and integers.
In formal reasoning (like doing logical proofs or long
division, or writing a computer program, etc), operations are performed
on symbolic structures as executed by trained human thinkers.
When the steps are difficult and throughout the training of
practitioners, steps in a formal reasoning
process are typically supported by body-external props. That is,
extended formal reasoning depends on external `scaffolding (see
Clark, 1997) such as by writing physical symbol tokens on paper (and,
very recently, by using the support of programs running on digital or
analog computers). In
computer hardware, formal methods are automated in general terms by the
use of symbol tokens coded into bits in a digital computer. The third
domain for the employment of symbolic theory lies in a particular view
of various cognitive operations involved in human language and human
reasoning (Chomsky, 1965; Fodor, 1975; Newell & Simon, 1972). The
symbol tokens in language (and probably in general cognition) are the
words and sound components of a particular language.
As
clarified by Haugeland (1985), in order to function as advertised, the symbol
tokens
in any symbolic system must
be,
first of all, digital,
that is, discretely distinct from each other and reliably recognizable
by the available computational or cognitive equipment. This is
essential in order for the computational mechanisms to manipulate the
symbols infallibly during processing.
Second, the symbols used must
be either
apriori
or composed from apriori components.
Some set of apriori units must be available at the time
of origin of a symbolic system from which all further symbol structures
are constructed. In the case of logic or mathematics, an initial set of
specific units is simply postulated (e.g., the integers, proposition p or
the initial symbol S). In computing, the bytes execute
operations in discrete time, but the units and primitive operations
were engineered into the hardware itself, and are thus obviously apriori.
Human language and everyday reasoning are more problematic
since we dont know what they are in advance. Indeed, the
discovery of the list of innate primitive units is the one of the
primary targets of research in modern linguistics (e.g., Chomsky,
1965). (The discrete time assumption is almost never discussed.) Most
often linguists assume that the initial list of primitives includes at
least units like [Vowel], [Consonant], [Noun], [Past Tense], [Sentence]
and so forth.
The
third property of symbols, although one that Haugeland did not say much
about, is that they must be static.
Since symbolic or computational models always function in
discrete time, it must be the case that at each relevant discrete time
point (that is, at each tick of the discrete-time clock), all relevant
symbolic information is available. For example, if a rule is to apply
that converts apical stops into a flap, then there must be some time
point at which the features that figure in the rule, [+stop], [+voice],
[+apical] etc. are all fully represented and either are holding steady
or somehow are constrained to synchronize with each other while the
rule applies in a single step of discrete time. Thus, properties in a
symbolic system cannot unfold gradually in continuous time but must, at
the relevant clock tick, have some discrete symbolic value.
(Of course, there is nothing to prevent simulation of
continuous time with discrete sampling, but this is not what the
symbolic, or computational, hypothesis about language claims.)
Now,
how can symbolic units with these properties be naturalized and sought
in a human brain? Formal symbols assume some very nonbiological
properties. It is one thing for humans to manipulate such units in a
deliberative way leaning on the support of paper and pencil so each
step can be written down and checked for accuracy, and for computers to
employ specialized discrete-time hardware to process symbolic
structures. But it is
another matter altogether to casually assume that real formal symbol
structures are actually processed in a discretized version of real time
by human brains. The problem is that if we study language as a facet of
actual physical human beings (rather than as a particular instance of a
discrete-time `Platonic heaven) then its processes and its
products must have some location and extent in real time and real space.
Truly natural language must be visible and accessible to
scientific research methods that investigate events in space and time
real events in real time. If
there is temporally discrete behavior of the human brain (which is
certainly quite possible), then we must surely study this empirical
phenomenon directly by gathering our data in continuous time in
order to discover just where temporal discreteness can be observed and
how the discrete-time performance is achieved.
Simply assuming there is a sharp apriori divide between
language as a uniformly serial-time structure and speech as a real-time
event (as claimed in Chomksys competence/performance
distinction), is a very risky assumption to make.
And, in our view, there is a fair amount of evidence that it is
simply false.
The
second reason for rejecting the view that language is essentially
formal is that it seems clear that, from a biological viewpoint,
language is fundamentally and essentially a spoken medium not
written one. Contrary to the practice of linguistic theory, it is the
written and orthographic versions of language that are derivative from
speech. It is especially
in written language where the symbol-like characteristics like
near-discreteness, timelessness and closed inventories of symbol
tokens are most pronounced. Yet written (and edited)
language is based upon historically recent, culture-dependent methods
(dating back only a few thousand years), using cognitive processes that
are partly dependent on literacy, logic and mathematical
generalization. Even today fewer than half the human population is
literate and only a minute fraction could appreciate the meaning of a
diagram of the structure of a sentence or a syllable even if one spent
some time explaining these images to them.
These
seem to us to be real problems for the traditional view ones
that cannot simply be brushed off with assertions like ``We dont
really know much about how the brain works anyway.
For example, if every human utterance exists only by being
built from clearly discrete building blocks, that is, if linguistic
expressions are always discrete structures assembled from linguistic
atoms (rather like the words and letters on this page are composed from
inventories of very limited size), then why isn't it always similarly
transparent to speakers (as well as to linguists) what the data
structures actually are for any utterance in any language?
Of
course, we agree that in many specific cases it is intuitively
very clear to speakers what the components are. For an English sentence
like `Don kicked that', the morphemes are transparently just the
set Don, kick, -ed and that.
And each of the phonemes that spell the cognitive
representation of these morphemes seems to be a simple, atom-like sound
unit [dankikt...].
Furthermore, at the syntactic level as well, there is a fairly clear
discrete phrase structure that represents [Don] as a unit in
parallel with [kicked that] (serving as subject and predicate
respectively), and we intuitively appreciate that the bond between, [kick]
and [ed] is much tighter than the bond between [kicked]
and [that].
But
every linguist also knows that very often it is not obvious how
many parts there are in a stretch of speech nor exactly what the parts
are. Thus, it's not clear how many morphemes there are in such a simple
word as `strawberry', or how many phonological parts there are
in `chive'. The
piece `-berry' looks like an obvious morpheme, but what about `straw-'?
It doesn't seem to be related semantically to `wheat straw' or `soda
straw' nor to
contribute any specific meaning except as a kind of diacritic on `-berry'
(to keep strawberry distinct from raspberry and
elderberry). But if straw
is only a diacritic, then why chop off berry in the first place?
Perhaps strawberry should be monomorphemic
(and listed independently in the cognitive lexicon) despite the
fact that berry also occurs both as a single word and as part of
other words with related meanings.
But, this solution ignores the generalization that the berry
in strawberry looks like, and means roughly the same as, the
isolated word berry. So should this redundancy be avoided? Well,
it depends on the analysts theoretical assumptions about
grammars, of course.
And
is `chive' made of 3 cognitive sound units, or 4 or 5 (since the
initial consonant and the medial vowel have either two distinct parts
or smooth gliding motions both acoustically and articulatorily)? Such
problem cases occur frequently in every language.
It is often very difficult to parse phrases, isolate the
segmental phonemes, the morphemes and other units from spoken or
written language. In these
challenging but very typical cases the morphological analysis is not
obvious at all. Instead, linguists must bring some theoretical
assumptions to bear in order to justify one analysis over others. Two
of the very first assumptions seem to be the Symbolic Language
hypothesis (SL) and the Atomic Inventory hypothesis.
1.
The
Symbolic Language Hypothesis (SL):
Every utterance in every language is constructed entirely of discrete,
timeless symbols. Complex units are composed
of combinations of simpler units in a series of levels.
2.
The Atomic Inventory Hypothesis (AI):
All languages select their phonological (and other primitive) symbols
from a universal discrete set a set that is not very large.
All
significant differences within the sound systems of any language as
well as all differences between languages are supposed to be
representable in this alphabet of symbols. Languages, it is claimed,
can never differ in the continuous-valued implementation of these
minimal symbols. If they
could, then languages would not be entirely symbolic.
(Notice that English orthography, like most other alphabetic
orthographies, explicitly embodies a related assumption since all the
words in correct sentences are constructed from a set of 26 letters and
the words are supposed to be selected from a standard dictionary of
vocabulary items. Much
more generally, linguists assume these properties are true of spoken
language as well.) For
spoken language the discreteness of phonological atoms, that is, of the
segmental distinctive feature vectors, guarantees the discreteness of
all other linguistic units spelled from them.
Of
course, it is possible that the SL Hypothesis is correct, but what if
it turns out to be partly false? What if words and other apparent
linguistic units are sometimes only nondiscretely different from
each other (unlike printed letters)? And what if linguistic units like
words and phonemes are not always timeless static objects, but turn out
to be necessarily, essentially, events in time? If (a) genuine category
nondiscreteness exists, or if (b) there exist any linguistic structures
that are essentially temporal (as opposed to merely `implementationally
temporal), then the bold symbolic assumption would be seriously
compromised. Certainly the
SL hypothesis would not seem to be so obviously correct that one would
feel justified in clinging to it as a reliable guide whenever
descriptive problems gets murky.
Instead,
we propose that the first steps should be taken toward a new discipline
of linguistics -- we might call it `Embodied Linguistics
(even while risking scorn for such a trendy term!). Step one
must be to naturalize language and fit it into a human body, that is,
first of all, to cast it into the realm of space
and time.
The first step in doing this is to change our focus of attention from
the study of linguistic knowledge (normally conceptualized as
static and symbolic) toward the study of linguistic behavior and
performance, since behavior and performances exist in time.
We take this step, not because of assumptions about learning
(like the behaviorists did, Bloomfield, 1926), but simply because we
need temporal information -- timing data to discover how
the whole system really works. The
cognitive system is assumed to run in time and can only be understood
that way. The SL assumption deprives us of temporal data about human
language and cognition. Only
through the study of specific language performances under many
conditions can accurate understanding of the cognitive form of language
arise. Indeed, we will
present evidence in this essay of both the nondiscreteness of
particular linguistic patterns and also demonstrate one
language-specific pattern that is essentially temporal.
These results clearly violate the Symbolic Language hypothesis
and the Atomic Inventory hypothesis and support a view of language as
something that is essentially embodied. Lets turn to the evidence.
Speech
timing research in general.
Research
on speech production and perception has shown from the earliest era in
the mid-50s that manipulation of aspects of speech timing could
influence listeners perceptual experiences.
These aspects include the slope of energy onset in fricatives,
the duration of acoustic intervals corresponding to consonant closures
and vowels, the slope of formant frequencies and so on (see reviews in
Lehiste, 1970; Klatt, 1976). For
example, chopping off the very front of
the [S]
in `shore with a waveform editor, so that the noise onset
is very abrupt rather than gradual, creates the perceptual experience
of an affricate, as in `chore'. Shortening the duration of a
final [s] in `hiss (keeping the offset of noise intensity
gradual by editing from the middle of the [s]) can create the
perceptual experience of `his' for English speakers. (Klatt,
1976, reviewed many of these perceptual phenomena.) Of course, for
languages with long and short vowels, like Hungarian [tør] tör
break vs. [tø:r] tQr
dagger, if you lengthen a short one, it sounds like a long
vowel and if you shorten a long one, it sounds like the word with a
[short] vowel (for Hungarian listeners).
Clearly listeners in these cases are using some kind of
measurement of onset slope and fricative or vowel duration extracted on
the fly to make these judgments. Similarly,
speakers are controlling these properties during their production.
Do these phenomena pose a challenge to the traditional theory?
How could the traditionalists maintain that serial order is all
that is linguistically relevant about the time axis?
Their answer is that there is both linguistic Competence,
the cognitive symbolic system, and Performance, the system of
psychological and physical constraints.
Traditionalists
argue that sensitivity to these temporal patterns - even if they are
exploited by listeners in speech perception are not evidence
about the language itself (Jakobson, Fant, & Halle, 1952; Halle
& Stevens, 1980; Chomsky & Halle, 1968, pp. and see Lisker
& Abramson, 1971). Thus,
in its symbolic form, the Hungarian words tör vs. tQr,
for example, might differ in either the number of segments (that is, 1
/ø/ vs. 2 /ø/s) or possibly in whether a single vowel
segment has the feature [+long] or not (that is, [`,
-long]
vs. [Q, +long]). The actual implementation in time is
interpreted to be a consequence of universal processes of
interpretation of the symbol external to the linguistic description per
se. This is a reasonable
story. However notice that
this view still rests on an important theoretical claim about the
phonetic implementation, one that may prove vulnerable.
The traditional account of speech timing is theoretically
tenable only given the assumption that the phonetic implementation
processes are universal. Only if phonetic implementation of discrete
phonetic symbols works the same way for all languages could it be also
true that linguistic utterances are composed entirely of symbols and
differ from each other only in symbol-sized steps.
Since it is assumed languages cannot differ from each other in
their phonetic implementation methods, the phonetic implementation must
be universal.
The
phonology is supposed to embody the language-specific properties of
speech, while the phonetic inventory and its implementation is
universal. It has to be, if the phonetic space is to include all the
phonetic capabilities of the human species.
The fact that every language depends on the same phonetic
inventory provides a key part of the Chomskyan account of why children
can learn language so quickly (since the universal phonetic
representation supports their ability to listen to and imitate their
mothers speech). This universal space is defined by a specific
inventory of symbolic phonetic units that have serial order as their
only version of the time scale (Chomsky & Halle, 1968).
Now
what if it could be shown that what distinguishes a pair of words in
some language (or a class of words that share a feature contrast) is intrinsically
a temporal property, rather than something intrinsically
nontemporal that happens to be implemented in time?
We have seen that the simple fact of differing in duration is
not sufficient evidence since implementation rules might account for
the duration effect. But, imagine a case where two sound classes differ
from each other in some particular durational relationship and the
duration of some acoustic segment (or articulatory segment) to some
adjacent acoustic segment has to be, let us suppose, near some
particular durational ratio.
If such a case could be found and if it could also be shown, in
addition, that the temporal ratio shows no indication of being an
implementational universal, then such a case would be strong evidence
that temporal constraints can be intrinsic to specific languages. A
static, durationless symbol would serve as a very poor model to
describe such a situation. Most importantly, it would call into
question the conventional isolation of time from linguistic cognition
and would imply that the phonological grammars of particular languages
incorporate some temporal properties.
Section
II. Germanic Voicing (or
Tensity) Contrast: Temporal Aspects.
One
of the best examples of such a language-specific temporal pattern used
for phonological purposes is the contrast in English among stops and
fricatives between those transcribed with /b, d, g, z,
/ and
those transcribed with /p, t, k, s,
/.
That is, we focus on a `feature' or a class of contrasts in the
coda position of a syllable (that is, near the end of a syllable after
the main vowel). In English, pairs of words like `lab-lap,
build-built' and `rabid-rapid' contrast in this feature, as
do German Bunde-bunte (club-Plur,
colorful- Nom., Sing.),
Swedish bred-brett (broad-Basic, broad-Neuter) and
Icelandic baka-bakka (to
bake, burden-Acc). This
type of pattern was probably a characteristic of the ancient Germanic
proto-language of 2-3 thousand years BP and has been inherited in
somewhat different form by most modern Germanic languages.
The contrast is between pairs of stops and fricatives, like
/z-s, b-p, d-t/, etc. at the end of a syllable or between syllables.
Because of the importance of several different articulatory and
acoustic correlates of this feature, there has been some dispute over
the years as to the proper characterization of the difference - whether
it is essentially a feature of `Voicing' (that is, glottal closure and
pulsing) or some complex of other factors usually summarized in the
term `Tensity', where /s,p,t,
/ etc. are [+tense] and
/z,b,d,
/ etc. are [!tense]
or [lax].
It
is clear that one characteristic of this contrast is that it depends
significantly on timing to maintain the distinction.
In particular, for the words with Voiceless consonants, like
English `lap' and `rapid', the preceding vowel is shorter
while the stop closure is longer in the /p/ words (eg, `lap, rapid')
relative to the corresponding words with /b/ (e.g., `lab, rabid')
(Peterson & Lehiste, 1960; Lisker, 1985; Port, 1981).
Of course, since speakers typically talk at different speaking
rates, the absolute durations of vowels and consonants are highly
variable measures in ms. It is not the case that absolute durational
values (eg, in milliseconds) are employed by listeners (since in that
case they would both produce and perceive
more /p/s and /s/s at slow rates and more /b/s and /z/s at faster rates
but they do not). Instead,
if all other properties are kept constant, the ratio of durations tends
to be much less variable than, e.g., the absolute durations in ms, as
shown in Figure 1.
Perceptual
experiments with synthetically constructed speech or experimentally
manipulated natural speech confirm that it is the relative durations
that determine judgments between minimal pairs like /lab-lap/ and
/rabid-rapid/ whenever other cues to the `tensity' feature are
ambiguous (eg, Lisker, 1985; Port & Dalby, 1982).
Port has called this relationship ``V/C ratio - the
relative duration of a vowel to the following obstruent constriction
duration. This ratio is
smaller for the [!voice]
(or [+tense]) consonants than for the corresponding [+voice] (or [!tense])
consonants. The V/C ratio
is relatively (though not completely) invariant across changes in
speaking rate, syllable stress and segmental context (Port, 1981, Port
& Dalby, 1982).
|
Figure 1. Stimuli and results from
Port & Dalby (1982) study of consonant and vowel timing as cues for
voicing in English. The top panel shows sound spectrograms of two of
the synthetic stimuli employed. These show the shortest
(140 ms) and longest (260
ms) vowel durations for dib. For each vowel duration step, nine
different silent medial-stop closure durations were used (= total of 40
stimuli). Subjects were asked to identify them as dibber or dipper.
The lower panels show the percent identification as dibber as a
function of medial stop closure duration (on the left) and
(on the right) as function of consonant/vowel ratio (where the
stop duration of each stimulus is divided by its vowel duration).
Looking at the perceptual boundary
(50%-50% identification) note the much tighter clustering in
right panel around a C/V ratio of .35.
|
In
several other Germanic languages, similar measurements of speech
production timing and perceptual experiments
using manipulations of V and C durations have shown that listeners pay
attention especially to the relative duration of a vowel and the
constriction duration of a following obstruent, that is, stop or
fricative (Port & Mitleb, 1983; Pind, 1995).
So apparently this durational pattern is important for the
production and perception of these Germanic languages.
Could
the V/C Durational Ratio be a Temporal Universal?
The
first thing to check when a durational invariant is found is to check
whether this timing difference reflects universal and unavoidable
articulatory correlate of a contrast in, say, glottal pulsing during a
consonant constriction. If
this turned out to be the case, it would support the standard view that
the phonology, specified in terms of patterns of universal phonetic
features, is the locus of all differences between languages. `The
phonetic capabilities of man', said Chomsky and Halle, are an inventory
fixed for, at least, historical time. This species-wide alphabet is
said to facilitate rapid language acquisition by infants and to assure
that speakers of different languages can learn each other's language
(given enough time). So
for this temporal pattern to show evidence of being universal, either
in association with some segmental feature or not, would fit the view
that it is only in the choice and deployment of static features that
languages may differ from one another.
Any temporal differences that might show up can only occur as
the result of the speech production or perception apparatus, which is,
by hypothesis, a universal.
Phonetic
Implementation.
To deal with these phenomena for speech production within the
symbolic view of language, one might postulate an implementation system
(outside language) that takes a (universal) alphabet of phonetic
symbols as input and translates them into output gestures with
particular temporal constraints. Individual features will differ in
their effects on the temporal behavior of the output device.
In this case, the implementation rule would have to assure that
the ratio of the vowel duration to the following consonant constriction
is about 1.5 or greater for the +Voice case (so the vowel is quite a
bit longer than the consonant constriction), and closer to 1 for the
Voice case (so the V and C are about the same duration).
Accounts along this line have been proposed for similar
phenomena by, for example, Chomsky and Halle (1968, Chapter 10), Halle
and Stevens (1980), Klatt (1976), Port (1981) and others.
To
see how such implementation systems might work, we will look more
closely at Klatts (1976) model (also Port, 1981).
Klatt observed, first, that although many factors might
influence the duration of a vowel, the effects of each factor depend in
part on how many others apply at the same time.
For example, comparing the duration of the vowel in rabid
vs. rapid (where the voicing feature will make the vowel
shorter in rapid) with a similar difference in lab vs. lap,
the vowel might be 12% longer in rabid than rapid but 18%
longer in lab than lap.
The reason for the difference is that the vowel is overall
longer in both lab and lap than in rabid and rapid
due to the presence of an additional syllable in the rabid-rapid
pair. This nonlinearity
led Klatt to propose that each vowel might have a minimum duration
beyond which it could not be compressed. Then he applied constant ratio
(that is, linear) timing rules to the remainder.
Thus each vowel had an `inherent duration for
[æ], lets suppose, 100 ms observed from monosyllabic words
with final voiced stop (as in lab or lag) spoken in
isolation, and also a minimum duration estimated as 65% of the
inherent duration. Then constant ratio rules would lengthen or shorten
the remaining interval (here 35 ms) between the inherent duration (here
100 ms) and the minimum duration. Thus,
to implement the vowel duration in, say, lacky, we take the 35
ms (inherent minus minimum) and multiply by 0.6 (to shorten for the
following voiceless stop) and then by 0.7 (to shorten for the second
syllable). Then we add the
resulting 14.7 ms (= 35*0.6*0.7) and add it to the minimum duration.
This gives a target duration for [æ] in lacky or
backing or rapid of 79.7 ms (± noise). By this method,
every segment either has its `inherent duration or one that is
derived by such temporal implementation rules depending on the context.
Leaving
aside the issue of how accurately this way of specifying the rules
works (of course, there are several other proposals for how to write
these rules, e.g., van Santen, 1996), there are many reasons why this
entire approach is implausible. The first problem is the question of
what use these durations in milliseconds might be? Who or what will be
able to use these numbers to actually achieve a target duration of 79.7
ms for this vowel? There
is no existing model for motor control that could employ such
specifications. We need
another theory of motor control to make use of these specs
and attempt to generate a vowel of n milliseconds duration (see
Fowler, Rubin Remez, & Turvey, 1981; Port, Cummins, & McAuley,
1995). Second, durations
in milliseconds seem fundamentally misguided since speakers talk at a
range of rates. So it seems that it should be relative durations that
are employed, not absolute durations (see Port, Cummins, & McAuley,
1995). Third, such a system has no apparent way for global timing
patterns (eg, regular stress timing) to influence timing.
This model computes a duration for one segment at a time only.
Longer intervals get their duration just by adding up the individual
segments that comprise it. Finally,
fourth, if the duration pattern for the Germanic voicing contrast is
really the relative duration of a vowel to
the following consonant, this kind of model will have
difficulty. The vowel duration effect of the voicing is computed by a
context-based rule like the one above, while the stop closure is just
the inherent duration of the following consonant. It is difficult to
see how a rule-governed duration and an inherent duration could be
coordinated to assure a particular durational ratio since the ratio
itself plays no role in the rules.
Despite
these implausible features, it is difficult to prove the impossibility
of such an account. After all, if formal models can simulate a Turing
machine, they are very likely to be able to deal with such relational
temporal phenomena by some brute-force method. But an
implementational solution along this line is only interesting if
certain very specific constraints are applied to the class of
acceptable formal models, as Chomsky has frequently pointed out (1965).
And, if one can always add additional phonetic symbols to the universal
set and apply as many rules as you please, then it could be claimed
that only the Germanic language group happens to employ a particular
feature that is universally implemented in this temporal way (even
though no nonGermanic languages have been observed to exhibit it).
But to always permit postulation of new symbols for every new
temporal effect is surely not sound scientific method.
Yet,
short of proliferation of unique features, an implementation rule for
the Germanic voicing effect cannot be universal.
Most languages in the world (including, e.g., French, Spanish,
Arabic, Swahili, for example) do not exploit the relative duration of a
vowel to the following stop or fricative constriction as correlates of
voicing (Chen, 1970; Port, Al-Ani, & Maeda, 1980).
We know from classroom experience that if you play English
stimuli varying in vowel and/or stop closure duration -- stimuli that
lead native English speakers along a continuum from rabid to rapid
-- those stimuli with varying V/C ratio will tend not to change
voicing category at all for French, Spanish or Chinese listeners. Their
voicing judgments are almost completely unaffected by V/C ratio. They
just pay attention to glottal pulsing during the constriction. Such
durational manipulations may affect the naturalness of the stimuli, but
do not make them sound more Voiced or less Voiced.
The
most plausible conclusion to be drawn from this situation is that the
Germanic languages share the property of manipulating V/C ratio for
distinguishing classes of words from each other.
English listeners, at least, make a categorical choice between
two values of a feature that might be described as `Voicing' (or as
`Tensity' or `Fortis/Lenis'). But there is nothing universal about this
property. It just happens to be a way that one family of closely
related languages controls speech production and speech perception to
distinguish vocabulary items. Thus,
we have an undeniable temporal pattern, one that requires some rather
specialized machinery to produce or perceive, yet which apparently must
be a learned property of the phonological grammar of specific languages
as a `feature for contrasting sets of words. To call this
distributed temporal pattern a `symbol, that is, a static token,
is to make it impossible to see what it really is an
intrinsically temporal pattern that is part of the specification of
words in this group of languages.
There
are other well-known timing effects as well that are related to the
prosodic patterns of languages. Kenneth
Pike first commented in 1945 that the languages of the world seem to
fall into two general types as far as their prosodic pattern is
concerned stress-timed
languages and syllable-timed languages. In modern
linguistics, prosodic timing is most often viewed in terms of this
dichotomy. The taxonomy rests on the impression that languages seem to
have distinct rhythmic styles. Languages like English and Russian seem
to be stress-timed, since stressed syllables have a tendency to be
regularly spaced in time, whereas in French and Spanish, it seems that
each syllable is trotted out in a regular rhythm. This is based on
auditory impressions of regular time intervals, as, for example, by
tapping a finger on a table (Jones, 1918/1932; Pike, 1945; Abercrombie;
1967). The taxonomy
implies a hypothesis of isochrony, or equal spacing in time.
Impressionistically, equal time intervals seem to apply to the level of
stressed syllables in English and Russian. Thus, if we say, eg, `He
EATS poTAtoes toDAY, the stressed syllables seem equally
spaced. One could tap a finger for each one. But if we say `Hes
EATen the poTAtoes toDAY, it seems like the timing is the
almost same even though there are now two additional unstressed
syllables inserted between EAT- and -TA- (especially if
you tap your finger on each stress). On the other hand, in French, for
example, if we say `Je ne parle pas francais, it seems
like a finger could be tapped for each of the 5 syllables, and that all
are about equally spaced. Pike,
seconded by Abercrombie, suggested that all languages might fall into
one type or timing pattern of the other.
When
careful instrumental measurements are used, however, it is typically
found that neither inter-stressed intervals nor inter-syllable
intervals are actually produced isochronously (Classé, 1939;
Bolinger, 1965; Pointon, 1980; Wenk & Wioland, 1982; Tajima, 1998).
Of course, it should not be a surprise that perfect
isochrony is not found. That is a very challenging test to meet.
Naturally, lacking a realistic theory of timing and given such
a difficult test to meet, experimental studies have repeatedly failed
to support hypotheses of isochrony. The question is ``Just how
regular does the timing have to be to support such a
hypothesis? No
answer was available.
Furthermore,
investigation of the mora in Japanese during the 1980s called
into question the original simple taxonomy. A third type of timing
pattern was revealed. In Japanese, a mora can be a simple CV syllable,
like -ta-, but syllables with a long vowel, like too-
or a final consonant, like han-, count as two moras.
Thus a word like, say, Honda
has 3 moras, just like Fujita. And Tookyoo has 4
moras like Fujiyama. In the Japanese syllabic writing system
(that is, in the hiragana and katakana scripts), each
mora is written with a single symbol.)
The traditional description of the mora found in Japanese
pedagogy holds that it is an isochronous unit that
all moras have equal duration (so that Honda and Fujita
should take the same amount of time to pronounce). Despite
experimental support for the mora as an isochronous temporal unit (Han,
1962; Port et al., 1980; Homma, 1981), it has remained somewhat
controversial among phoneticians (Beckman, 1982).
Disagreements
about the mora arose primarily from whether each individual mora is the
same duration as each other mora -- so that any compensation for
inherently longer or shorter consonants and vowels is achieved within
the mora (Beckman, 1982)--
or whether compensation may occur in neighboring moras. In the latter
case, the regularity of mora timing should be observable only by
looking at a window of several moras, e.g., a whole word or
phonological phrase (Port et al, 1980, Port et al, 1987; Han, 1994).
Figure 2. Five native Japanese speakers (Port et al., 1987) read words from 6 word groups (illustrated in Table 2). The word durations are pooled across subjects according to the number of moras. Each word group is labeled by the initial mora (so ka is the first mora in all tokens with that group). As the number of moras increases the word duration increases linearly in both fast and slow tempos. These data show that word duration is almost completely determined by the number of moras, not the number of syllables. Other experiments show that if a mora is inherently short, then segments in the preceding and following syllables stretch to compensate.
Number
of moras
|
Word
|
English
gloss
|
HI
group
|
|
|
1
|
hi
|
sun
|
2
|
hika
|
subcutaneous,
hypodermal
|
3
|
hikka
|
faux
pas,
misstatement
|
4
|
hikkaku
|
to
scratch, claw
|
5
|
hikkakeru
|
to hang,
hook
|
6
|
hikkakenai
|
not to
hang
|
7
|
hikkakerareru
|
get hung
|
Table 1.
An example of number of moras for one group of words used in Port et
al. (1987). The other groups followed a similar pattern of stacking up
moras. The apostrophe indicates a pitch accent in the utterance.
In the Port et al. (1987) it was found that the duration of whole words with the same number of moras had nearly the same duration no matter what their segmental content or number of syllables (Figure 2 and Table 1). Changes in speaking rate resulted in the average mora duration shortening. Other experiments show that within a word, neighboring segments within the mora as well as adjacent moras expand and compress in compensation for their neighbors, as shown in Figure 2. Other experiments have shown that perceptual segmentation of speech based on mora units is also found in prelexical processing in Japanese (Otaka, Hatano, Cutler & Mehler, 1993; Cutler & Otake, 1994) suggesting that the mora is highly salient as a unit in the production and perception of Japanese. So, if measurements are done just the right way, Japanese does show a type of ioschrony even if it is neither syllable timing nor stress timing. More rigorous methods are surely needed for studying English and other languages. No one should have expected absolutely equal durations for any units in speech. Furthermore, English speakers can sometimes exhibit apparent syllable timing, for example in certain styles of babytalk. So a more flexible experimental and theoretical framework was needed to explore these issues without assuming that only one timing principle can apply to each language.
Speech
Cycling Experiments.
The
idea of speech cycling evolved from consideration of research
methodologies employed in studies of limb motor control by studying
behavior of the limbs when two periodic patterns are interfering with
or interacting with each other (e.g. , Haken, Kelso, & Bunz, 1985;
Kelso, 1995; Treffner & Turvey, 1993). The idea of speech cycling
is to study the timing of repetitive speech when interacting with, or
coupled with, an external auditory metronome (Cummins & Port, 1998;
Cummins, 1997; Tajima, 1998; Port, Tajima, & Cummins, 1999; Tajima
& Port, in press). Subjects
repeat a short piece of text in time with a computer controlled
metronome-like pattern. It turns out that there are strong constraints
on the method of performing these tasks, some of which are apparently
universal, but others of which characteristically differ from language
to language. If it is true
that speakers of different languages perform such tasks in ways that
differ between languages, this would be further evidence that the
actual `grammars employed by native speakers include temporal
characteristics.
So,
for example, notice that if English speakers repeat a phrase like `Take
a pack of cards, Take a pack of cards, Take a pack of cards,
at a rate of
about 1 repetition per second, they are likely to pronounce it putting
a pitch accent on both take and cards.
The pattern of timing of the beats for the these two pitch
accents is very likely to be one or the other of two rhythms -- either
a slow two beats (as in ``1, 2; 1, 2; TAKE a pack of CARDS; TAKE a pack
of CARDS) or else in a fast three-beat pattern (1, 2, 3; 1, 2, 3; TAKE
a PACK of CARDS; TAKE a PACK of CARDS; TAKE a PACK of CARDS).
The 2- and 3-beat harmonic timing patterns are probably
universal (since these meters show up in most musical traditions), but
locating the onsets of pitch-accented syllables at the beats may be an
idiosyncratic property of English.
Figure 1.
This figure demonstrates the use of speech cycling in Cummins &
Port (1998). In this study, a corpus of 30 short phrases with identical
prosodic structure was used. Each phrase was of the form
X for a Y like beg for a dime.
Subjects repeated these types of phrases to a metronome pattern and the beat,
or vowel onset location, was found for each target syllable. A
succession of 14 pairs of alternating high (H) and low (L) tones is
used in each trial. The interval from the high to low tone was fixed at
700ms (to keep speaking rate constant). The relative time of the low
tone to the next high tone varied randomly over the range .2 - .7
(observed phase = the duration in milliseconds between the H and L
tones divided by the duration from one H tone to the following one).
The subjects were asked to speak the phrases such that the first word
of each phrase (e.g., beg) lined up with the high tone, and the
last word (eg, dime) lined up with the low tone. The simplest
hypothesis is that there are no rhythmic constraints on speech
production. If this were true, subjects should have produced the onset
of the final stress (dime) wherever the L tone occurred in the
phrase cycle (from .2-.7). Instead, the histograms are strongly
multimodal, with three clear modes, near .33, .5, and .66, but subject
KA showed only two modes at .3, and .5. Thus the subjects could not
accurately perform the task but were strongly biased to locate the
final syllable onset at harmonic fractions of the complete H-H
cycle (that is, at ½, 1/3 or 2/3).
By
picking analogous text material in two languages and urging subjects to
adopt specific rhythmic patterns, it is possible to compare the ease
and difficulty of various temporal patterns in two languages. For
example, Tajima & Port (in press) employed a three-beat
(waltz-like) speech cycling pattern to reveal distinct temporal
characteristics for English and Japanese. It was found that English
speakers show more `stress-timing effect that do Japanese
speakers in similar situations. Thus
the characterization of at least English and Japanese as stress-time
and mora-timed can be justified experimentally. Furthermore they found
evidence that individual English speakers can slip from timing that
favors equal timing for individual syllables to timing that is more
strongly concerned with the spacing of just the stressed syllables.
These
data strongly suggest that prosodic features of timing should be
included as part of a linguistic grammar if by a grammar, we
mean the idiosyncratic characteristics of each language that
differentiate it from others. One way in which languages differ, as
shown by speech cycling experiments, is in how syllables are identified
as `prominent (cf. Tajima, Zawaydeh & Kitahara, 1998; Cutler
et al. 1986). Since there are temporal aspects that differ from
language to language, the notion of a grammar must be expanded to
include such properties. The alternative, to leave timing out of our
understanding of language, is to underestimate what is involved in
speaking a language competently.
Even
leaving aside the issue of language differences in timing, the very
naturalness of speech cycling tasks
that is, the tendency, even without a metronome, to adopt a
rhythmical pattern in our speech especially when the speech is
repetitive reveals something fundamental about human language.
Like so many other human behaviors, speech is very naturally
and unavoidably entrained to periodic patterns. There are countless
examples of periodically structured speech: songs, auctioneering,
preaching, sports announcing, canonical babbling in infants, marching
chants, work songs, etc. (Port, et al., 1999; Ejri, 1998). Even
everyday speech tasks like reading a list of words aloud is enough to
strongly encourage regular speech periodicity (Leary & Muench,
1998). Like the speech cycling technique, these examples of everyday
speech remind us of the ubiquity of cyclic behavior: from breathing and
running to pacemaker cells in the heart. Therefore, is it surprising
that we find speech may exhibit timing similarities to other biological
and neurological functions? We
believe it is a good bet that many cognitive functions, both linguistic
and non-linguistic, will eventually be shown to entrainable to rhythmic
patterns at some appropriate rate.
To
acknowledge similarities in the timing of speech to timing found in
other biological systems (animal and human) implies a description of
speech that is dynamical and changes in time[1].
Formal logic and symbolic phonological accounts of prosody (Chomsky
& Halle, 1968; Liberman & Prince, 1977; Selkirk, 1984; Hayes,
1995) provide us with no basis for interpreting the temporal
characteristics of stressed and unstressed syllables (van Gelder &
Port, 1995; Port et al., 1999). Yet, as we have seen, speech prosody
does affect the timing of language in many important ways. So once
again, the strong tendency of speech to entrain to periodicity like
other motor actions is evidence supporting the view that the linguistic
structure is intrinsically dynamic, not symbolic.
We have pointed out
that certain phenomena are counterevidence for the Symbolic Language
hypothesis. One kind of counterevidence is demonstration that temporal
phenomena are sometimes intrinsic to language, as we have demonstrated
in the previous sections. A second kind of critical evidence would be a
convincing demonstration of patterns that are linguistically distinct
and yet not discretely different not different enough that they
can be reliably differentiated and yet are not the same either. This is
a difficult set of criteria to fulfill, but in fact such a situation
has been demonstrated in a number of experiments for several languages.
Position
of stop
|
German
word
|
|
English
gloss
|
Initial
|
der Back
|
|
mess table
|
|
der Pack
|
|
pack,
bundle
|
Medial
& final
|
Plural
|
Singular
|
|
|
Alben
[alben]
|
Alb
[alp]
|
elf
|
|
Alpen
[alpen]
|
Alp [alp]
|
mountain
pasture
|
Table 2.
Some examples illustrating the traditional data regarding
word-final devoicing in German. Using an atomic inventory
of static features, the following phonological rule is said to apply: [-sonorant]
[-
voice]/ ____$, where $ is a syllable boundary.
The
best studied case is in the near-neutralization of voicing in
syllable-final position in Standard German. Here, final voiced stops
and fricatives, as in Bund and bunt (`club, `colorful),
are said to neutralize to the voiceless case as shown in Table 2. That
is, although Bunde and bunte show that the words contrast
in the voicing of the apical stop, the pronunciation of Bund and bunt
seem to be the same. Both sound like [bUnt].
But the difficulty is that they are not pronounced exactly
the same (Port, Dalby & ODell, 1987; Port & Crawford,
1989). These pairs of
words, with final stops and fricatives, actually are slightly different
as shown in Figure 4. If
they were the same, then in a listening task you would expect 50%
correct (pure guessing like English too and two);
if different, you expect 99% or better correct identification under
good listening conditions (like German Bunde and bunte).
Instead, they are different enough that listeners can guess
correctly which word was spoken with only about 60-70% correct
performance (Port & Crawford, 1989)!
But such performance shows that the word pairs are neither
clearly the same nor clearly different. The voicing contrast is almost
neutralized in this context, but not quite. The differences can be
measured on sound spectrograms, but for any measurement one chooses
(vowel duration, stop closure duration, burst intensity, amount of
glottal pulsing during the closure, etc.), the two distributions
overlap considerably.
Figure 4. Schematic
waveforms of several sample German word pairs by one of the speakers in
Port & ODell (1985). The onset of the first vowel is begins at
0 ms extending for the length of the white recangle, the
smaller gray rectangle is the period of voicing visible during stop
closure, the straight line is the stop closure which is voiceless, and
the triangle represents the stop burst duration (the release of the
stop). These results do not support the notion of a static, binary
voicing feature ([±voice]). While the timing for the voiced and
voiceless word pairs are similar, there is a tendency for the vowel
before the underlying voiceless obstruent (e.g, the vowel
in Alp) to be slightly shorter then the voiced one. There is
also more voicing into the stop closure for the voiced stops and a
longer stop burst for the `underlying voiceless stops than for
the corresponding voiced ones.
One
might hope that this could be blamed on some speakers reliably
differentiating them and others just guessing.
But most listeners do better than chance.
So these minimal pairs lack an essential property of any symbol
token (Manaster-Ramer, 1996; Port, 1996): They could not be said to be
either discretely different or distinct. Yet this situation could not
be a mere performance effect either, since all speakers of the language
(or at least most) exhibit this rule-governed pattern in many dialects
including Standard German. Nor
is this situation unique. A similar phenomenon can be found in most
varieties of American English where the /t/ and /d/ in butting
and budding are said to be `neutralized to an apical
flap. But again statistically distinct distributions can be found and
listeners can typically guess which word was spoken with better than
chance accuracy. (Fox & Terbeek, 1976; Port, personal
communication).
Altogether
then, there is considerable evidence that several essential and
unavoidable predictions of the SL hypothesis are shown to have clear
exceptions. In our view, the SL hypothesis along with the AP hypothesis
must be abandoned. Surely some properties of language exhibit nearly
symbol-like properties, but apparently this is not all there is in
human languages. This means that linguistics cannot make the convenient
assumptions of timelessness and digitality.
We believe this implies that linguistics must assume a rather
different kind of processing system underlies the grammars of skilled
speakers of a language. These production and perception mechanisms
will:
1.
process
stimulus phenomena and interact with the world in real time,
2.
exhibit
a tendency to behave periodically and to entrain to periodic patterns
at the time scale of the mechanisms of body movement (lets say,
with periods longer than 20 ms). That
is, the physical systems of our bodies and the neural system that is
coupled to it in realtime, cannot help being dynamical and behaving
like oscillators under certain conditions.
3.
The
cognitive mechanisms should apparently exhibit a strong tendency toward
discrete categoricity in the production and perception of speech
sounds, but it must not be required to successfully achieve this all
the time.
Of course, a satisfactory psychological theory for linguistics to build on does not yet exist. But there are some attempts at theories of speech perception and production that appear promising to us. They will be introduced in the next two sections.
Section
V: Dynamical models of speech perception
Naturally,
there have been a few who have insisted on biologizing, or embodying,
linguistic behavior all along. There is work on the peripheral aspects
of language in this mode both models of speech perception and
others about speech production. Some of the models of speech perception
are now fairly sophisticated and plausible. Grossberg has been
developing his ART model, Adaptive Resonance Theory, for
many years. This system models pattern recognition based on partial
matches and learns to recognize new patterns given repeated exposure to
novel inputs. The model is able to continue learning at all times and
to simultaneously acquire new categories when needed (Grossberg, 1980,
1986, 1995). In recent
years Grossberg and his group have elaborated the model to directly
address issues in the speech perception literature that have sat on the
table for a decade with few attempts at direct modeling
To
describe the ART model, we can look first at how it categorizes
sensory inputs. An archetypal case might be to recognize a letter or
other visual object. An input light pattern excites a particular set of
feature detectors in a field called F1. (These features that
appear in F1 were extracted earlier from repeated stimulation by
an unsupervised learning process.)
A neural network with learned weights connects this field of
features to an F2 field. F2 has mutually competitive
units specialized for each basic category that has also been learned by
the system from exposure to the environment.
Weights on the F1-F2 connections assure that one unit in F2
(where a unit can be either a single node or a group of nodes) will be
somewhat more excited than any of the others. As soon as this
excitation exceeds a threshold, this unit begins to inhibit its
competitors and sends a feedback signal unit
back to the units of F1, exciting only the F1
units that have a learned statistical relationship with this F2
unit during previous exposures. This feedback signal represents an
attempt to match expectations to the actual input pattern in F1.
If enough F1 units receive input from both the sensory
system and topdown feedback, then a `resonance loop' is established; F1
excites F2 which excites F1 and so on the system
has reached a stable attractor. (If
the match between expectations and sensory inputs is not good
enough, then the first-guess F2 unit is inhibited as another
cycle begins again using an alternative F2 hypothesis.) The
perceptual experience of `seeing an object or `hearing a
word spoken, according to Grossberg, can only occur when a resonance
loop has been achieved.
Grossbergs
group has recently shown that with very little modification, this model
can also recognize word-like patterns at F2 with auditory
stimuli arriving over time and even employ temporal constrasts to
distinguish words despite variation in rate (Boardman, et al, 1999;
Grossberg and Myers, in press). Since
all the linguistic units (phonemes, words, etc) are discovered by the
system on the basis of statistical analysis of inputs, discreteness and
context independence are not required properties. Fuzzy categories,
partially discrete categories and context dependent variants would
appear to be quite natural results of the acquisition processes
employed by this kind of system. The
most economical representation will not necessarily be the shortest or
the one requiring the fewest stored objects.
Masking
Fields.
To apply ART to spoken language in general requires some
additional theoretical development since auditory inputs that arrive
over time are also hierarchically structured as language-specific sound
segments, morphemes, words, phrases, etc. The notion of a masking
field is an interesting idea even though it has not yet been
fully developed (Cohen & Grossberg, 1987, Grossberg, 1986). A
masking field is a considerable expansion of the F2 level in which
there are units corresponding to perceptual objects on several spatial
or temporal scales. Any word or phrase has other, shorter alternative
parsings. Thus, if we pronounce the word catalogue, we have also
pronounced cat, cattle, a log, etc. as well as just the phonetic
segments /k,Q,R, I, l, A, g|
/.
Why do we hear the single word only rather than various other
possible components that are also somehow there?
Why do we `see (that is have an awareness of seeing) a
letter E, when we might also, or instead, have `seen one vertical
bar and three horizontal bars? The
masking field allows all the alternative coherent parsings to be
activated but there is a hardware bias in favor of larger or longer
identification units so that they usually win out over shorter ones and
reach the state of resonance. This bias for the longest possible unit
that is compatible with all the evidence accounts for why we are aware
of only catalogue but not cattle, log, etc.
For
language perception, a masking field of some kind might be imagined
that employs a series of levels and learned category-like units
one level for each linguistically relevant layer: phonemes, morphemes,
words, common phrases, etc. The specific units on each level are
attractors that are (a) acquired through a learning process and (b) are
susceptible to change over time, (c) may not always be discretely
different from one another, and (d) may manifest themselves as temporal
patterns (including especially periodic ones).
Only time will tell if dynamical models of cognitive behavior
of this style will prove to capable of underpinning actual linguistic
analysis of phonologies and syntax.
Section
VI: Dynamical Theories of Motor Control
The
issue of motor control for speech has been strongly influenced by work
on general motor control especially of the limbs. Since
Bernstein (1967), there has been a strong tradition of dynamical models
of speech. There isnt space here to go into this issue very far,
but recent work in the dynamical tradition (Browman &
Goldstein,1995; Saltzman & Munhall, 1989; Guenther, 1995) has shown
how many aspects of speech production are naturally interpretable in
terms of dynamical behavior of the neuro-physiological system that is
the speech apparatus.
These
dynamical models of motor control and perception of compositional
patterns are still in their infancy as far as being explicit systems
that account for speech and language. They scarcely begin to address
most issues that are of interest to linguists. Nevertheless they appear
able to plausibly account for such properties as:
(a)
parsing
at multiple scales simultaneously yet delaying perceptual decisions
until necessary information is available, thereby implementing
hierarchical structure presented in time,
(b)
discreteness
of categorical perception in many cases (Liberman
et al., 1967; Kuhl
& Iverson, 1995),
(c)
priming
of words even when not consciously perceived,
(d)
attraction
of prominent syllable onsets to periodic patterns
and nested hierarchical structures like feet, phrases
etc.(Cummins & Port, 1999; Tajima & Port, in press),
(e)
the
incommensurability of speech sounds from language to language (Logan, Lively and Pisoni, 1991; Port, Al-ani and
Maeda, 1980).
These provide a good start toward mechanisms capable, in principle, of dealing with the nearly discrete (and somewhat symbol-like) language structures by using mechanisms that are fundamentally continuous and dynamical and which operate in realtime.
Section
VII. Conclusions
We
began this review of the problem of timing and temporal patterns in
human speech by first exploring the theoretical constraints regarding
the issue of timing that stem from the ubiquitous assumption within
linguistics that language is, in fact, a formal symbolic system (and
not something that merely approximates a formal symbolic system in many
circumstances). This assumption, which seems so obvious to linguists as
to scarcely require any justification at all, turns out to have
devastating consequences for understanding how timing could play any
role in language.
The
Symbolic Language assumption (a) prevents timing from being visible as
a property of human languages (thereby shortcircuiting research on the
many temporal aspects of phonetics and phonology), (b) forces the
highly implausible assumption that phonetics is based on an apriori
universal segmental inventory, (c) prevents exploitation of data on
temporal phenomena (such as processing time, reaction time, response
latency, etc) thereby undermining research in psycholinguistics.
Further, (d) it forces the postulation of a sharp boundary
between the formal, symbolic discrete time domain of language and human
cognition (competence) in contrast to the continuous,
fuzzy, realtime domain and human physiology (performance).
This gap has thus far proven unbridgeable and will probably remain so
as long as the assumption that language is nothing but a formal
symbolic system holds sway within the linguistics and biases research
in the disciplines that find themselves dependent on aspects of
linguistic theory, like phonetics, phonology, psycholinguistics,
neurolinguistics, etc.
Of
course, despite such severe theoretical misdirection, phoneticians and
speech scientists have gone ahead and studied temporal phenomena in
languages anyway. They
have done so for many reasons, most often not to discover properties of
individual languages but in order to develop natural sounding synthetic
speech (Klatt, 1976; van Santen, 1996), or practical speech
recognition, or to understand and treat disorders of speech production
or perception.
In
doing research on speech timing, they have found that many acoustic
features are manipulated by specific languages in the specification of
words, including such features as: the risetime of energy onset, the
duration of acoustic segments like vowel duration, stop closure
duration, fricative or nasal consonant duration, and so on. A variety
of specific rules for temporal implementation of phonological features
and segments have been developed to `account for' these patterns, even
though rules that specify durations in terms of absolute timing like
milliseconds are implausible for many reasons.
Still, specifically durational cues for particular phonological
features, like postvocalic stop and fricative voicing in many Germanic
languages) are now well documented.
In
addition, the most convincing cases of nondiscrete sound contrasts, or
semicontrasts, find much of their evidence in the time domain. The
incomplete neutralization due to the word-final devoicing rule in
German, like the incompletely neutralizing flapping rule for American
English /t/ and /d/, has consequences that are most clearly observed in
the time domain.
Furthermore,
this research has turned up evidence of some larger scale temporal
patterns in human language like the tendency toward regularity of
speech timing of stressed syllables in some languages (like English),
regularity of the timing of moras (in Japanese), and the possible
regularity of intersyllabic intervals (in Chinese or French, although
we have not seen good experimental evidence to back up the intuitive
feel of such a timing strategy).
So,
all of these phenomena make the traditional linguistic view that speech
sounds are nothing but serially ordered static segments (where the
segments are vectors of static distinctive features) highly unlikely.
Timing is apparently intrinsic to human language, not imposed
only during output processes. This
should be reassuring in many ways, because it supports a view of language
as real psychological behavior, as a cognitive skill, embodied in real
human brains, rather than as a static inventory of structures and
formal rules in some Platonic idealized space.
Of
course, the consequence of such a view is that the traditional
discrete-time symbol manipulation processes for symbol strings and
trees so familiar in linguistic thinking for the past 40 years, simply
cannot provide the mechanisms that will be required to understand the
production and perception of words, phrases and sentences.
An entirely new psychological theory will have to underlie all
linguistic thinking. What kind of mechanisms could these be?
In
fact, theoretical approaches to psychological mechanisms that are
compatible with a view of language that takes time and timing seriously
have been under development in various laboratories in recent years.
These
theories are unfamiliar, not only to most linguists, but even to many
psychologists. The most
sophisticated systems for language perception and production seem to us
to be those developed from the
ART
model by Steven Grossberg and his colleagues.
These models assume that perceptual information arrives in time
(rather than being statically presented all at once), they can delay
perceptual
commitment
when necessary until essential information arrives, employ temporal
information of various kinds and learn intrinsically temporal patterns,
and can learn discrete categories from experience while apparently
still being capable of acquiring nondiscrete units as well. These
models are complex and demand mathematical and programming
sophistication that are rare within linguistics and even experimental
psychology. Certainly,
they still require much further development. But, they do suggest the
feasibility of continuous time models capable of dealing with such
complex temporal structures as human language.
We
hope that these phenomena and theoretical considerations will lead to
development of a theory of language that is genuinely capable of
embodiment in the human body functioning in a physical world.
Bibliography
Abercrombie,
David. (1967). Elements
of general phonetics. Chicago: Aldine Pub. Co.
Beckman,
Mary. (1982). Segment duration and the `mora' in Japanese. Phonetica, 39,
113-135.
Bernstein, N. (1967) Coordination and Regulation of Movements. (Pergamon Press, London).
Bloomfield,
Leonard. (1926). A set of
postulates for the science of language. Language, 2,
153-164.
Boardman,
I., Grossberg, S., Myers, C., and Cohen, M. (1999). Neural dynamics of
perceptual order and
context
effects for variable-rate speech syllables. Perception &
Psychophysics, in press.
Bolinger,
Dwight. (1965). Pitch accent and sentence rhythm. In Isamu Abe and
Tetsuya Kanekiyo (Eds.),
Forms
of English: Accent, Morpheme, Order
(pp. 139-180). Cambridge,
MA: Harvard University
Press.
Browman, Catherine and Louis Goldstein (1995) Dynamics and articulatory phonology. In R. Port and T. van Gelder (eds) Mind as Motion: Explorations in the Dynamics of Cognition. (MITP, Cambridge), pp. 175-194.
Chen, Matthew (1970) Vowel length variation as a function of the voicing of the consonant environment. Phonetica 22, 129-159).
Chomsky,
Noam, & Halle, Morris. (1968). The Sound Pattern of English.
New York: Harper & Row.
Chomsky,
Noam. (1965). Aspects of the Theory of Syntax. Cambridge,
MA: MIT Press.
Clark,
Andy (1997) Being There: Putting Body, Brain and World Together Again.
(Cambridge: MITP)
Classé,
Andre. (1939). The Rhythm of English Prose. Oxford: Basil
Blackwell.
Cohen, Michael A. and Steven Grossberg (1987) Masking fields: a massively parallel architecture for learning, recognizing and predicting multiple groupings of pattenred data. Applied Optics 26, 1866-1891.
Couper-Kuhlen,
Elizabeth. (1993). English
Speech Rhythm. Pragmatics and Beyond. Amsterdam: John
Benjamins.
Cummins,
Fred. (1997). Rhythmic coordination in English speech: An
experimental study. Doctoral
dissertation,
Indiana University.
Cummins,
Fred. (1999). Synergetic Organization in Speech Rhythm. In
W. Tschacher & J.-P. Dauwalder
(Eds.), Dynamics,
Synergetics, Autonomous Agents: Nonlinear systems approaches to
Cognitive
Psychology
and Cognitive Science.
Studies of Nonlinear Phenomena in Life
Science Volume 8
(pp.
256-266). New Jersey : World Scientific.
Cummins,
Fred & Robert F. Port. (1998). Rhythmic constraints on stress
timing in English. Journal of
Phonetics, 26,
145-171.
Cutler,
A., Mehler, J., Norris, D.G., & Segui, J. (1986).The
syllables differing role in the segmentation of
French
and English. Journal of Memory and Language, 25, 385-400.
Cutler,
A., & Otake, T. (1994). Mora or phoneme? Further evidence for
language-specific listening.
Journal
of Memory and Language, 33,
824-844,
Ejri,
Keiko. (1998). Relationship between rhythmic behavior and canonical
babbling in infant vocal
development. Phonetica, 55,
226-237.
Fodor,
J. A. (1975). The
Language of Thought.
New York : T. Y. Crowell.
Fowler,
Carol A., Rubin, P., Remez, R., & Turvey, M. (1981). Implications
for speech production of a
general
theory of action. In B. Butterworth (Ed.), Language Production
(pp. 373-420). New York:
Academic
Press.
Fox,
R. and D. Terbeek (1977) Dental flaps, vowel duration and rule ordering
in American English. J. Phonetics 5, 27-34.
Grossberg,
Steven (1980) How does the brain build a cognitive code?
Psych Review 87, 1-51.
Grossberg,
Steven (1995) Neural dynamics of motion perception, recognition
learning and spatial attention. In Port and van Gelder (eds) Mind as
Motion: Explorations in the Dynamics of Cognition (Cambridge; MIT
Press), pp. 449-490.
Grossberg,
Steven (1986) The adaptive
sef-organization of serial order in behavior: Speech, language, and
motor control. In E. C. Schwab and H. C. Nusbaum (eds.) Pattern
Recognition in Humans and Machines. Vol 1: Speech Recognition, pp
187-294 (Orlando: Academic Press)
Grossberg,
S. and Myers, C.W. (2000). The resonant dynamics of speech perception:
Interword integration
and
duration-dependent backward effects. Psychological Review, in
press
Haken,
H., Kelso, J. A. S., & Bunz, H. (1985). A theoretical model of
phase transitions in human hand
movements. Biological
Cybernetics, 51, 347-356.
Halle,
Morris, & Stevens, Kenneth N. (1980). A note on laryngeal features. Quarterly
Progress Report,
Research
Lab of Electronics, MIT, 101, 198-213.
Han,
Mieko S. (1962). The
feature of duration in Japanese. Study of Sounds, 10,
65-75.
Haugeland,
John. (1985). Artificial Intelligence: The Very Idea. Cambridge,
MA: Bradford Books, MIT
Press.
Hayes,
Bruce. (1995). Metrical stress theory: Principles and case studies.
Chicago: University of Chicago
Press.
Hockett,
C. (1954). Two models of grammatical description. Word, 10,
210-231.
Homma, Y. (1981) Durational relationship between Japanese stops and
vowels. J.
Phonetics 9, 273-281.
Jakobson,
R., Fant, G., & Halle, M. (1952). Preliminaries
to Speech Analysis: The Distinctive Features and
Their
Correlates.
, Cambridge, MA: MIT Press.
Jones,
D. (1932). An Outline of English Phonetics. Cambridge: Cambridge
University Press, 3rd edition,
1st
edition published 1918.
Kelso,
S. (1995). Dynamic Patterns: The Self-organization of Brain and
Behavior. Cambridge,
MA: MIT
Press.
Klatt,
D. (1976). Linguistic uses of segmental duration in English: Acoustic
and perceptual evidence.
Journal
of the Acoustical Society of America, 59,
1208-21.
Kuhl, Patricia & Paul Iverson (1995). Linguistic experience and the `perceptual magnet effect'. In W.
Strange
(Editor). Speech Perception and Linguistic Experience: Issues in
Cross-Language Research
(York
Press: Baltimore) pp.121-154
Lehiste, Ilse (1970) Suprasegmentals. MIT Press.
Liberman,
A. M., F. Cooper. D. Shankweiler and M. Studdert-Kennedy (1967)
Perception of the speech code, Psychological Review 74,
431-461.
Liberman,
M., & Prince, A. (1977). On stress and linguistic rhythm. Linguistic
Inquiry, 8, 249-336.
Lehiste,
I. (1970). Suprasegmentals. Cambridge,
MA: MIT Press.
Lisker,
Leigh, and Arthur Abramson. Distinctive features and laryngeal control. Language,
44:767-785,
1971.
Lisker,
Leigh. Rabid vs. rapid: a catalogue of cues. Haskins
Laboratories Status Report on Speech
Research,
1985.
Logan, John, Lively, and D. Pisoni (1991). Training the perception of
/r/ and /l/: First report. J.
Acous. Soc America 89, 874-886.
Manaster
Ramer, Alexis (1996) A letter from an incompletely neutral phonologist. J.
Phonetics 24, 477-489.
Newell,
Allen, & Herbert Simon. Computer science and empirical inquiry.
Communications of the ACM,
pages 113-126, 1975.
Otake,
T., G. Hatano, A. Cutler & J. Mehler. Mora
or syllable? Speech
segmentation in Japanese. Journal
of
Memory and Language,
32:358-378, 1993.
Peterson,
Gordon E., and Ilse
Lehiste. Duration of syllable nuclei in English. Journal of the
Acoustical
Society
of America,
32:693-703, 1960.
Pind,
J. Speaking rate, VOT and quantity: The search for higher-order
invariants for two Icelandic speech
cues. Perception
& Psychophysics, 57:291-304, 1995.
Pike,
Kenneth Lee. The
Intonation of American English.
University of Michigan Press, Ann Arbor, 1945.
Pointon,
G. Is Spanish really syllable-timed? Journal of Phonetics,
8:293-305, 1980.
Port,
Robert (1981) Linguistics timing factors in combination.
J. Acous. Soc. Amer. 69, 262-274.
Port, Robert (1996) The discreteness of phonetic elements and formal linguistics: response to A. Manaster Ramer. J. Phonetics 24, 491-511.
Port,
Robert F., Salman Al-Ani, and Shosaku Maeda. Temporal compensation and
universal phonetics.
Phonetica,
37:235-252,
1980.
Port,
Robert F., Fred Cummins, and J. Devin McAuley. Naive time, temporal
patterns and human audition.
In
Robert F. Port and Timothy van Gelder, editors, Mind as Motion:
Explorations in the Dynamics of Cognition.. MIT Press, Cambridge,
MA, 1995.
Port,
Robert and Penny Crawford (1989) Pragmatic
effects on neutralization rules,
J. Phonetics 16, 257-282.
Port,
Robert, and Jonathan Dalby. C/V ratio as a cue for voicing in English. Journal
of the Acoustical
Society
of America,
69:262-74, 1982.
Robert
F. Port, Jonathan Dalby, and Michael O'Dell. Evidence for mora timing
in Japanese. Journal of the
Acoustical
Society of America,
81(5):1574-1585, May 1987.
Port,
Robert F., and Fares Mousa Mitleb. Segmental features and
implementation of English by Arabic
speakers. Journal
of Phonetics, 11:219-229, 1983.
Port
and Michael O'Dell. Neutralization of syllable-final voicing in German. Journal
of Phonetics,
13:455-471,
1985.
Port,
Robert , Keiichi Tajima, & Fred Cummins. Speech and rhythmic
behavior. In Geert J.P. Savelsburgh,
Han
van der Maas, and Paul C.L. van Geert, (Editors), Non-linear
Developmental Processes, pp.
53-78. Elsevier, Amsterdam, 1999.
Port,
Robert & Timothy van Gelder, editors. Mind as motion:
Explorations in the dynamics of cognition.
Bradford
Books/MIT Press, 1995
Port,
Robert F.. Linguistic
timing factors in combination. Journal of the Acoustical Society of
America,
69:262-274,
1981.
Saltzman,
Elliot and Kevin Munhall (1989) A dynamical approach to gestural
patterning in speech production. Ecological
Psychology 1, 333-382.
de
Saussure, Ferdinand.(1916).
Cours de linguistique générale. C. Bally & A.
Sechahaye, Paris.
Scheutz,
Matthias (1999) The Missing Link : Implementation and
Realization of Computations in Computer and Cognitive Science.
Unpublished dissertation, Cognitive Science and Computer
Science, Indiana University.
Selkirk,
Elisabeth. The
syllable. In Harry van der Hulst and Norval Smith, editors, The
Structure of
Phonological
Representations, Part 2,
pp. 337-383. Foris Publications, Dordrecht, 1982.
Tajima,
Keiichi and Robert Port. Speech
rhythm in English and Japanese. In John Local, editor, Papers in
Laboratory
Phonology VI. Cambridge
Universty Press, Cambridge, 1999.
Tajima,
Keiichi, Bushra A. Zawaydeh, and Mafuyu Kitahara. A comparative study
of speech rhythm in
Arabic,
English, and Japanese. In Proceedings of the XIVth International
Congress of Phonetic
Sciences,
pages 285-288, San Francisco, CA, 1999.
Tajima,
Keiichi. Speech
Rhythm in English and Japanese: Experiments in Speech Cycling.
Doctoral
dissertation,
Indiana University, 1998.
Treffner, Paul and Michael T. Turvey (1993) Resonance constraints on rhythmic movement. J. Expl Psych: Human Perception and Performance 19, 1221-1237.
van Gelder, Timothy, & Robert Port. It's about time: Overview of the dynamical approach to cognition. In
Robert Port and Timothy van Gelder, editors, Mind as motion: Explorations in the dynamics of cognition, pages 1-43. Bradford Books/MIT Press, 1995.
van Santen, J.P.H. (1996). Segmental duration and speech timing. In Yoshinori Sagisaka, Nick Campbell, & Norio Higuchi (Editors), Computing prosody: Computational models for Processing Spontaneous Speech. Springer Verlag, New York.
Wenk, B. J. & Wioland, F. Is French really syllable-timed? Journal of Phonetics, 10:193-216, 1982.
[1] The term dynamic can be used in many ways. Here, we mean a system where quantitative variables follow rules in quantities of time.