Some Background Material
Robert F. Port
1. Formal linguistics assumes a low-dimensional fixed phonetic segment inventory.
strategy of phonemics. Word
This is an early paper by Halle of some historical interest, in which some strikingly hopeful views about phonetics are voiced. He points out most people assume ``we speak in a manner quite similar to the way in which we write'' -- that ``speech consists of sequences of discrete sounds which are tokens of a small number of basic types'' -- direct auditory analogs of graphical letters. However, acoustic analysis of speech sound has shown, he says, that speech is ``a continuous flow of sounds, an unbroken chain of movements.'' He hypothesizes (as he must) that the ``acoustical wave must contain clues which enable human beings to'' segment speech into into discrete events. (I believed that myself for many years.) ``If we could state what these clues are, we could presumably build a machine to perform the same operation.'' Exactly. But unfortunately no such machine has been built in the succeeding half century. In 1954, it may have seemed a good bet that phoneticians and speech engineers would soon succeed. But in 2006, it is no longer reasonable to assume that a low-D description of the physical form of language that is speaker independent and segmented into Cs and Vs will still be discovered. We simply must give up the hope for an apriori phonetic description that is both physical and acoustic, yet also abstract and linguistic.
Artificial Intelligence: The Very Idea. (
Haugeland's goal was to clarify the
assumptions of formalist models of cognition. These basic
assumptions are shared across computing, logic, artificial intelligence
and formal models of language and cognition. Major proponents of this
view include Chomsky,
Newell and Simon,
Fodor, Pylyshyn, Pinker, etc. Computers are
carefully engineered so that they can be interpreted in terms of
states only. Programmers (unlike computer engineers) do not need
to concern themselves with continuous time or continuous values of
linguistics always assumes an apriori set of
letter-like phonetic tokens. Thus it must presume the
physical existence of a fixed alphabet of phonetic symbols for spelling
linguistic items, as well as a discrete computational apparatus for
token manipulation. All
linguistic theories build upon this assumption even though it is surely
and the empirical claim of a universal phonetic alphabet has scarcely
even been defended by phonologists. (I am sure there must have
attempts, but I'm not familiar with them.) Chomsky and Halle
it to be intuitively obvious that such a fixed and small
a few hundred) of universal phonetic tokens exist. So the
problem, I claim, is that no such low-dimensional description of
phonetics is possible (See Port and Leary, 2005). Therefore
formal linguistics is impossible.
2. Segments are not compatible with spectrographic representations.
IPA, 1999. Handbook of the International Phonetic Association. Introduction. (pp 3-13, 27-30, 32-38).
thoughtful and wide-ranging chapter of the recent edition of the IPA Handbook (by Prof. John
Klatt, Dennis (1979) Speech
perception: A model of acoustic phonetic analysis and lexical access.
Journal of Phonetics 7,
Klatt presents two generic models for
recognizing speech in this important paper: LAFS
LAFS (``lexical access from spectra'') recognizes words specified in
raw acoustic form. No segment level at all! Given all the
variations due to
context, speaker, etc, this implies there will be many alternative
for each speech chunk (whether word, phrase or what). Today, all
effective speech recognition systems work basically this way, usually
implemented as hidden Markov models on sequences of spectral
shapes (specified in a rather large `alphabet' of spectral
But Scriber works
way (according to the insights of linguistics) since it converts
spectra into abstract
segmental, letter-like units, and then recognizes words and other
speech chunks using the resulting transcriptions. Back in the
1980s, my Indiana
David Pisoni always liked this paper and thought that LAFS was a model
that deserved serious consideration. But I, as a linguist-phonetician,
thought that LAFS could not
possibly be correct. I remember being almost
that Dennis Klatt even suggested such a thing. ``Oh well,'' I thought,
is just an engineer. Psychological plausibility is not so important for
engineers.'' This paper
1979, only 3 years after I got my PhD, and now, 30 years later, I
realize that the general approach of
LAFS was the right choice all
along! And is, in fact, the only psychologically
plausible approach. The
Scriber approach, despite its
great intuitive appeal to any alphabet-literate person, is not
guaranteed to work.
Huckvale, Mark (1997) 10 things engineers
have discovered about speech recognition. NATO ASI Workshop
on Speech Pattern Processing. pp. 1-5.
This paper has many surprises for
linguists. ``This paper is about what happens when you put
Performance first'' -- relative to Competence. Huckvale has many
humbling things to say to linguists.
Ladefoged, Peter (2004) Phonetics and phonology in
the last 50 years. Paper presented at `From Sound to Sense',
a conference at MIT.
This interesting paper on the history of
phonetics presents his concerns about the
gross mismatch between segmental
linguistic descriptions of
speech and the acoustic signal. He concludes rather disturbingly:
``phonologists have the problem of deciding whether they are describing
something that actually exists, or whether they are dealing with
epiphenomena, constructs that are just the result of making a
description.'' My conclusion, of course, is the latter, and I
think Ladefoged's was as well.
and parameters for different purposes. Paper presented to
the Linguistic Society of America, 2005.
This paper presents Ladefoged's final views on phonetics and phonology before his passing in January, 2006. Using generally different arguments than mine, he concludes that a full description of speech requires far more degrees of freedom than Chomsky and Halle propose and that a grammar could not possibly be ``just something in some speaker's mind.'' Instead he proposes that the phonology of a language is ``a social institution,'' while ``the acts of speaking and listening all involve adjusting articulatory parameters not phonological features.'' On the other hand, he says, ``phonological features are best regarded as artifacts that linguists have devised in order to describe linguistic systems.'' Although he does not develop these ideas much beyond these statements, this appears to be essentially identical to my position: phonology is a social institution existing on a slow time scale and real-time language processing makes no use of abstract phonological descriptions at all.
3. Variation is endless when you look close.
Labov, William. (1964) Phonological
correlates of social stratification. American Anthropologist 66, 164-176.
first paper is Labov's fascinating MA thesis and the second reviews the
results of his dissertation on language variation on the Lower
East Side of NYC.
Labov revealed vividly the subtlety in dialect variation. Of
individual speaks a different dialect, but in fact, we each speak in
styles of pronunciation detail. Our theories of language presume
is a linguistic state, something we could call `The English Language'
`Lower East Side English' -- in any case, some invariant Grammar.
But Labov's data show
us (in my view, not necessarily his) that no such
entities are definable in fact. Everything is always subject to
subtle phonetic variation. All we can know about a language
is distributional patterns in a high-D auditory/phonetic space.
Traditional theories of language presume that speaker and hearer share
some common Grammar. But the fact is they never ever
share the same grammar! And every word has an unlimited
variants! So why do linguists imagine they could describe
or other in discrete low-D terms?
Bybee, Joan (2002) Word frequency and context of use in the lexical diffusion of phonetically conditioned sound change. Language Variation and Change 14, 261-290.
Bybee and other variationists have explored the ways that the
variants for any word is influenced by such properties as the frequency
of occurrence of each
individual word -- as well as many other factors.
If every word has a bazillion variants, what can it mean to say ``I
know Word X''? Apparently it must mean either (A) ``I know the
list of (essentially) all variants'' or, perhaps, (B) ``I am able to
evaluate utterances for their similarity to the
whole distribution of variants of Word X.'' This does not
resemble what generative phonologists are doing -- but they too think
explicating what it means to know Word X.
memory, including memory for words, is far richer
than we thought.
T. J., Goldinger, S. D., & Pisoni, D. B. (1993). Episodic
encoding of voice attributes
and recognition memory for spoken words. Journal of
Psychology, Learning, Memory and Cognition, 19,
This paper presents powerful evidence from a `recognition memory' task that our memory for speech is not abstracted away from speakers' voices, as linguists and phoneticians tend to think (as claimed explicitly by Morris Halle, 1954 and 1985). We actually store lots of phonetic detail including speaker idiosyncrasies. This set of results troubled me for over a decade until I finally decided that I should try to consider that the data might actually mean what they seem to mean -- that an abstract phonological transcription of speech does not do the primary job of linguistic memory and does not limit what is stored in memory. This mental exercise is what led to the revolutionary theory being presented here. One implication of these results is that we store far more detailed information about speech than we linguists ever imagined. It also implies we store speech in a very concrete, auditory fashion -- not using abstract speaker-independent and context-free descriptors. Thus, a memory including specific examples (as well as prototypes and abstract specifications) is implicated. It turns out that lots of data from many areas of psychology have been implicating such representations for a long time.
This, the granddaddy of mathematical memory models, is called Minerva2, and is easier to understand than most since the math is simple linear algebra. You will see how a rich and detailed memory of concrete exemplars with no abstractions can perform better on training tokens yet still allow abstractions (and prototypes) to be computed on the fly whenever needed. Many more recent models (e.g., by Nosofsky and Shiffrin) use different mathematics but retain a key role for the training exemplars to predict results. (The pic: Doug Hintzman.)
O'Reilly, Randall and Kenneth Norman (2005) Hippocampal and neocortical contributions to memory: Advances in the complementary learning systems framework. Trends in Cognitive Sciences 6, 505-510.
a neural model for `episodic memory,' that
memory of randomly coocurring events (eg, the words you hear
say to you, or the specific events that happened on the way to work
this morning) that represents stored information on a single trial.
Gluck, M., M. Meeter and C.E. Meyers (2003) Computational models of the hippocampal region: linking incremental learning and episodic memory. Trends in Cognitive Science 7, 269-276.
The point of these papers is that there are neurologically plausible mechanisms that can learn details of specific examples of complex, multimodal episodes on a single trial without repetitive training. These papers suggest that a surprising amount of rote material could be available in the memory of speaker-hearers. Remembering all examples forever is not plausible, but we apparently remember much richer detail -- including language detail -- than most of us imagined.
Any implications for the study of language? Of course.
D. B. (1997). Some
thoughts on `normalization' in speech perception. In K. Johnson
Mullennix (Eds.), Talker variability in speech processing (pp.
Kipp McMichael and D. B. Pisoni (2000) Speech perception and
memory: Evidence for detailed episodic encoding of phonetic
events. In J. Bowers and C. Marsolek (eds.)
Rethinking Implicit Memory
Oxford Univ Press).
Also in Research on Spoken Language Processing, 24, 149-167 (Psychology Dept, Indiana University).
and colleagues draw the implications
for a theory of language of the fact that people use an
episode-like memory rather than an `abstractionist' model of speech
(like phones or
phonemes). Lachs et al. point out that the traditional
perspective on speech perception assumes that linguistic information
and contextual information are stored independently. Instead they
argue their data support a ``single complex memory system that retains
highly detailed, instance-specific information in a perceptual record
containing all of our experiences -- speech and
They are calling for a new kind of linguistics that lets go of the
assumption of discrete abstract units. Such a theory of language
is what I am now trying to initiate with this website.
Janet (2001) Exemplar
dynamics: Word frequency, lenition and contrast. In J. Bybee
Hopper (eds) Frequency Effects and the Emergence of Linguistic
She explains exemplar models (based on Hintzman) in her own way and addresses the question of how an exemplar model of memory might make predictions about speech production. She proposes that the perceived (and stored) distribution can be a direct model of production probabilities. So her algorithm for choice in speech production is simply the distribution of perceived tokens treated as a probability distribution of production targets selected from randomly. This is one of the first attempts by a linguist to consider the implications of detailed exemplar memory representations of speech.
Johnson, Keith (2005, mspt) Decisions and mechanisms in exemplar-based phonology. UC Berkeley Phonology Lab Annual Report 2005. Johnson considers some issues that will trouble linguists about exemplar models of memory for language.
5. What modality are phonetic targets?: articulatory, auditory, somatosensory?
Alvin M., Franklin S. Cooper, Donald P. Shankweiler and Michael
431-461. Reproduced in J. Miller, R. Kent and B. Atal
(eds.) (1991) Papers in Speech
Communication: Speech Perception. (Acoustical Society of
America, Woodbury, New York).
A monumental review paper summarizing the Haskins Labs view of speech perception arguing for the `motor theory of speech perception'. This group thought that words must be specified in memory in low-dimensional, segmental and articulatory terms, so the di/du problem (illustrated at right) was major. But if auditory memory is very large, then speakers can just store context-sensitive formant trajectories and burst spectrum shapes. The fact that di and du begin with what we call ``the same sound'' may be something that is noticed by most speakers only after learning to write with an alphabet. At least this issue about what speech sounds ``sound the same'' needs more research.
Lindblom, Björn (2004) The organization of speech movements: Specification of units and modes of control. Paper presented at conference `From Sound to Sense' at MIT. A survey of speech production theories.
Diehl, Randy, Andrew Lotto and Lori Holt (2004) Speech perception. Annual Review of Psychology 55, 149-179. Presents that case that, no matter what, the basic representations of word targets must be auditory.
Gallantucci, Fowler and Turvey (2006) The motor theory of speech perception reviewed. Psychonomic Bulletin and Review 13, 361-377.
Guenther, Frank (1995)
Speech sound acquisition,
coarticulation and rate effects in a neural network model of speech
perception. Psych Review 102, 594-621.
Guenther, Frank and Joseph Perkell (2004) A neural model of speech production and supporting experiments. Paper presented at `From Sound to Sense', conference at MIT, June 2004. Available at http://www.rle.mit.edu/soundtosense/.
Speech sounds must have both convenient articulatory properties and practical auditory properties. Diehl, Lotto and Holt argue for a speech perception process using only general auditory capabilities (rather than unique specialized perceptual skills that are used for speech and nothing else). Clearly, my position is closer to theirs than to the traditional Haskins Labs view. Guenther's feedforward production system just produces gestures -- normally with no feedback. But the primary realtime guidance for them is based on somatosensory (ie, orosensory) feedback (from muscles, joints and skin surfaces). So the best answer to a question about realtime targets is: the targets are somatosensory, and neither gestural nor auditory. So I don't agree (with Galantucci-Fowler-Turvey) that ``perceiving speech is perceiving gestures'' (their claim number 2), it is just perceiving speech. But their claim 3 that ``the motor system is recruited for perceiving speech'' seems to be true and provides a coherent, reliable coding method for storing speech patterns. (See Stephen M. Wilson, 2004, Listening to speech activates motor areas involved in speech production. Nature Neurosci.)
The issue regarding short-term memory (STM) is how words are stored for short periods of time -- under 20 sec or so. Baddeley uses the term `phonological loop' for the system that stores a small number of words for a few seconds. Linguists might assume that he means the same thing they mean when using the term ``phonological''. But his data show that this code is either an auditory code or an articulatory one -- but it is definitely not the kind of abstract, segmented, speaker-invariant code that linguists think of when they use this term. So the code is either motor or auditory (or both). Note that Baddeley claims the phonological loop has two parts -- a store or buffer (which exhibits confusion between words that sound similar even if articulated quite differently) and a (subvocal) motor representation used for rehearsal (which accounts for why Ss can remember fewer longer words than shorter ones, ie, the `word-length effect.'
Baddeley, Alan, D. (2002) Is working memory still working? European Psychologists 7, 85-97
this paper Baddeley extends his earlier model in a useful way by added
-- a store for information from many modalities that is retained
7. Why are letters so convincing despite all the evidence? Because of our education.
Olson, David R.
speech. Language and Communication
13. 1-17. This article is a revised
version of Chapter 4 in his outstanding and highly recommended book: The World on Paper: The Conceptual and
Cognitive Implications of Writing and Reading. (Cambridge:
1994). He points out that various human notions of what
language is are invariably based on whatever we represent with our
writing system. So the idea that our alphabetical writing is a
`cipher' (one-to-one replacement) from phonemes into graphemes has
reversed the true order! We linguists tend to believe our
made from phonemes
primarily because we use
letters to represent speech.
anticipated the argument I am making here over a decade
paper has been largely ignored, as far as I can tell. She
languages seem to be composed from letter-like segments primarily
because we have all been trained in alphabetic literacy.
These papers show quite clearly that a segmental description of speech (in terms, that is, of consonants and vowels) may be intuitively persuasive for us. But it is not universal -- indeed most humans, through our history, have not had such intuitions. It appears that some amount of alphabet training is necessary for people to hear speech in terms of Cs and Vs.
Ziegler, Johannes and Usha Goswami (2005) Reading acquisition, developmental dyslexia and skilled reading across languages: A psycholinguistic grain size theory. Psych Bulletin 131, 3-39.
review paper strongly endorses the view that the alphabetical
(using phonemes or letters) is a result of learning how to do
reading and writing. It also demonstrates that the inconsistencies in
spelling of English greatly complicate the problem of learning to read.
Children taught to read Finnish or Italian can read new words as well
year of school as English-speaking children can do after 3 years of
school. Of course, the latter's practical reading skill may still
be better than a
learning to read Chinese after the same amount of
Studies of reading reveal clearly how an understanding
of language in terms of phonemes is almost entirely
a consequence of learning how to read with an alphabet. This
direction of causality agrees
with my general story that segmental descriptions of speech are shaped
and reinforced by extensive practice at reading and writing. But
result is in serious
conflict with the conventional views of linguists and psychologists
language. The traditional view insists that learning a language
remembering vocabulary are skills that depend upon letter-like
phonemes. Some set of letter-like units are said to be available in
advance to all members of the
species. On this view, all
speakers, whether literate or not, should store language using a
alphabet. But the data do not support any such predictions.
course, there is the question
of how the alphabet could have been
invented if phonemes lack this intuitive basis. But it should be
kept in mind that
early Greek alphabet was the result of roughly 3000 years of efforts in
the middle east to engineer
a culturally transmittable writing system that was easy to
teach and use. And the alphabet
invented once -- as has been pointed out many times. I would
that those Greek `writing engineers' (standing on the shoulders of
earlier middle-eastern writing engineers) actually invented the phoneme
concept itself. The phoneme is the ideal, the conceptual
archetype, for which all alphabetical orthographies are just imperfect
Robert F. Port, April 9,