Coordinative Structures for the Control of Speech Production

Robert Port, Indiana University
November 1, 2007

In the 70s and 80s a new approach to the study of speech production has evolved that attempts to relate models of motor behavior in general to the control of speech articulators (Fowler, et al. 1990, Kelso & Taller, 1983, etc). The key concept in this approach is the "coordinative structure" which will be sketched out briefly here. To appreciate the concept, however, it is important to consider first the practical difficulties involved in the control of movement and in the use of feedback in achieving this.
Controlling many degrees of freedom. How could a "linguistic executive" deal with all the muscular degrees of freedom? The difficulty is that in controlling, for example, a human arm to reach and grasp something, there are a number of different joints each of which can be at a range of different angles as well as many different muscles controlling each joint. Thus, if the actor finds that his arm is too far to the left, he could adjust any of a number of different joint angles to correct it and thus has to choose from a large number of different muscles. In speech exactly the same problems arise since there are many ways to raise the tongue or achieve the acoustic effect of lip rounding. It is implausible that the "speech executive" itself directly controls all those muscles on a moment by moment basis. Instead, we may imagine that the linguistic executive attempts to control a much smaller number of parameters directly and permits a lower level system to take care of the muscular details. In short, we need to imagine a system in which speech control directly manipulates only a small number of degrees of freedom.
The ambiguity of acoustic feedback. Not only is there a practical problem of directly controlling muscles during speech production, but even if one proposes that acoustic feedback were available to guide the motion of the tongue, there is still a big problem. The feedback information will not, in general, be sufficient to tell the speaker which muscles to move. Although, of course, a servosystem such as a thermostat uses feedback from sensory information about temperature ("too cold" vs. "not too cold") to control the motor action of turning a switch on or off, using sensory information in most human action is much more difficult. The reason is that, while the thermostat has one degree of freedom in sensory information and one degree of freedom in action, human bodies have so many degrees of freedom that control by this method quickly becomes implausible. If the F2, let us say, is too high for the vowel intended, which muscles should be adjusted: the tongue body? the degree of rounding or protrusion of the lips? or the larynx position? The acoustic feedback itself is generally insufficient to tell the motor control system which articulator or combination of articulators is out of position. To find the appropriate solution, the system would have to try out a vast number of combinations of gestures to find the right one. Thus we would have to postulate another very complex layer of processing and interpretation to employ the acoustic feedback information in speech production. Thus it appears that speakers need a control structure for speech that will operate pretty much by itself using a minimum of closed-loop control.
Thus considerations such as these lead one to attempt to find a model of speech production (and of other types of skilled action control) that is as "open loop" as possible since this would require much less feedback about the action as a whole. We should imagine a system that is able to take care of itself in achieving particular motor acts when summoned to achieve them. Thus the system should require:
  1.  a small number of parameters of control (degrees of freedom of input),
  2.  using only internal (spinal or brainstem-based) feedback information about muscle and joint positions (but not auditory or visual feedback that would come via the cerebral cortex), and
  3. constraints on a very large number muscles to get some particular job done over a wide range of contextual conditions.
The Coordinative Structure (CS)
The model employed for dealing with such a production problem is called a synergism or coordinative structure. Although it is a kind of  "software product" and represents skill acquisition, a hardware model illustrating some of the right properties would be a heavily damped mass-spring system. The spring has a neutral length and will get either longer or shorter, as necessary, if it should be displaced away from this position in one direction or the other. Since the spring has inertia, we should imagine it as having a mass attached to it. If the spring were not damped, then it would tend to oscillate. So to simulate the damping, we may imagine the spring pushing a weight through a highly viscous medium.
Now imagine a hinged arm in such a medium controlled by, not one, but six different muscles that attach to the arm at different places and at different angles. A coordinative structure for controlling this might be a constraint that says "the angle of the arm should be phi." Now a tantalizing property of this approach is that there is no place outside the structure where the values of the lengths of the five muscles necessary to achieve this angle is specified. The external control says "achieve angle phi", but how to achieve that with this jumble of muscles is specified only within the coordinative structure itself. Furthermore, since the `command' is simply to reach a certain target angle, the command is invariant across various contexts in which it might be made - that is,  the command is the same whether the arm will actually move to the right or to the left.   Nor should one imagine that there is some formula by which the nervous system (or the scientist) could calculate the correct excitation of the muscles or the forces they should generate or the durations of the various time intervals that will result in achievement of phi. The CS could simply reset the resting length of the set of muscles so that effort to reach that configuration is immediately undertaken. Thus, even though it is not clear what variety there might be in the nature of the control parameters of coordinative structure, this very primitive model shows at least how decentralization of control could be achieved by treating a complex of muscles as a spring whose rest length can be adjusted by external command.
There are a number of general properties of this style of control.
1) Functional definition, not anatomical. The mass-spring model given just above is misleading since it implies that the CS is defined by a particular joint and its controlling muscles. Actually, although a CS should behave roughly like the muscle group above, the muscles are organized on a temporary basis for the achievement of a particular skilled activity. That is, the same muscles play a role in a very large number of different CS's, each specialized to achieve a particular kind of act and called upon when needed. Thus, we may imagine the muscles of the vocal tract to be organized into one system of CS's when chewing, another when drinking a fluid, another set when pronouncing and English [p], and yet others when producing a French [p] (assuming competence in both languages).
2) Multiple muscles, multiple joints.  The above model is also misleading since it only deals with a single joint, but most coordinative structures must adjust several component gestures that may have trading relationships with each other. Thus to achieve lip closure one needs a ``sufficient combination'' of upper lip lowering, lower lip raising and jaw raising.  This is much more complex than the combination of muscles to set the position of one joint, but it is the more general case. To combine joint motions correctly on the fly, the CS employs sensory information fed back from, eg, muscle spindles, in the involved muscles and joints so that appropriate combinations can be achieved.  Notice that this kind of feedback is (a) unambiguous about what correction is needed (at least after you have trained up the system with much practice) and (b) very fast relative to visual or auditory feedback that require cortical processing (eg, 5-15 ms vs. 200-300 ms).
    One consequence of these trading relationships is that individual muscles can trade-off. That is, in successive repetitions of a skilled gesture, one may find that the work is being done by varying combinations of muscle activity. Notice that this `motor equivalence' implies that when one looks at motor activity below the level of the CS, the complexity and variability increase. So the simplest description of the skilled gesture may turn out to be that provided by the `goals' of the CS.  Looking for more detail on a finer time scale should result in a description of increasing complexity.
3) Cyclic structure. Another  general property of such a control system is its inherent tendency to behave periodically. It seems that cyclic events are a natural form for control. One primary reason, perhaps, is that a spring-like system is a simple way to have dynamics intrinsic to the system. Another reason may be that periodic gestures can easily be nested within each other into hierarchical structures - a well-known property of linguistic structures. Thus we may imagine speech motor behavior very schematically as  a sequence of oscillator-like breath-group cycles within which are a string of oscillator-like syllable cycles within which are highly damped half- cycles for consonant gestures.
4) Topological invariance is the fourth property of the CS. It refers to the apparent tendency of a CS to specify the relative time within a gesture cycle at which another event occurs. That is, each CS may be thought of as producing a damped cyclic event occurring within the context of another periodic event at the next higher (that is, temporally longer) level. Thus, in bipedal locomotion we may ask when the time of onset of leg raising, R, occurs relative to the duration of a complete step cycle, the interval AB. The notion of topological invariance means that it will occur at a fixed fraction of the duration of the whole step. Thus the interval AR/AB (that is, the phase angle of  whole step cycle at which R occurs) will be constant despite major changes in the overall duration of AB, that is, despite changes in the pace of locomotion. In speech, the counterpart of a step cycle, AB, is believed to be the vowel gesture (or syllable gesture) and the analogue of foot raising might be the initiation of a syllable-final consonant gesture (Fowler, 1983). Such a structural property allows combinations of CSs to be invariant over changes in the overall rate of the act.

    How might such organizational units be brought to bear on the problem of speech? And why should linguists care about such structures if they are mere low-level motor strategies? Is there any reason to suppose that language itself is constrained by this style of control? I think there is good reason to think so. If we imagine speech as controlled by such a system of decentralized subsytems rather than by a single central executive that issues "commands to muscles," then we can begin to deal directly with the role of linguistics in speech production. After all, the CSs for speech will have much in common with phonetic features (but perhaps not the more abstract phonological features). Without taking responsibility for the skills involved in speech production, linguistics must continually foist off the motor problem onto someone else.


References

Kent, Ray (1983) The segmental organization of speech. In Peter MacNeilage (ed.) Speech Production (Springer-Verlag; New York), pp. 57-89.

Fowler, Carol, Philip Rubin, Robert Remez & Michael Turvey (1978) Implications for speech production of a general theory of action. In B. Butterworth (ed.)

Kelso, J. Scott, Betty Tuller & Katharine Harris (1983) A "dynamic pattern" perspective on the control and coordination of movement. In Peter MacNeilage (ed.) Speech Production (Springer-Veriag; New York), pp. 137-173.

Fowler, Carol (1983) Converging sources of evidence on spoken and perceived rhythms of speech: Cyclic production of vowels in sequences of monosyllabic feet. Journal of Experimental Psychology: General,.

Kelso, J. Scott & Betty Tuller (1983) A dynamical basis for action systems. In M. S. Gazzaniga (ed.) Handbook of Cognitive Neuroscience (Plenum; New York). 

Studdert-Kennedy, Michael (1983) Perceiving phonetic events. In W. H. Warren & R. E. Shaw (eds.) Persistence and change: Proceedings of the First International Conference on Event Perception. (Erlbaum, Assoc.; Hillsdale, NJ)