Audio I/O In VR
B582 VR Hardware Presentation
Ying Feng, Feb.
2000
3. Voice In VR
3.1 Mechanics of Speech
3.2 Voice Recognition
3.3 Voice Synthesis
3.4 Pros and Cons

3.1 Mechanics of Speech
How humans utter sounds in speech:
-
Control of air generated by the lungs,
-
Flowing through the vocal tract,
-
Vibrating over the vocal cord,
-
Filtered by facial muscule activity,
-
And released out of the mouth and nose.
Speech waveform:
-
The brain orchestrates the lungs to produce sufficient air exhalation,
the contraction of the various muscles in modifying the air, the control
over the visual and auditory inputs that affect speech, and provides a
guideline for rhythm in speech.
-
The brain can signal the vocal cord (larynx) to vibrate or resonate, which
forms the glottal waveform.
-
The air flow can be restricted and altered by positions of the facial muscles
in the oral cavity, placing all combinations of variations on the waveform.
-
Other human components assist the sophistication of speech by providing
sensory input during speech, such as the ear.
Back to top of this page.

3.2 Voice Recognition
Voice recognition is the technology by which sounds, words or phrases
spoken by humans are converted into electrical signals, and these signals
are transformed into coding patterns to which meaning has been assigned.
Common approaches to voice recognition:
-
Template matching:
-
Digitize voice into memory, match these data with digitized voice samples.
-
Simplest and higher accuracy (98%).
-
With most limitations: speaker dependent, needs training.
-
Feature analysis:
-
Processe the voice input using "Fourier transforms" or "linear predictive
coding", find characteristic similarities between the expected inputs and
the digitized voice input.
-
More flexibility, can handle accents, varying speed of delivery, pitch,
volume, and inflection; speaker-independent, no training needed.
-
Very difficult and less accuracy (90-95%).
For more details, refer to Jim Baumann's artical Voice
Recognition.
Back to top of this
page.

3.3 Voice Synthesis
Three basic techniques (in order of increasing complexity):
-
Waveform encoding:
-
Recording phases or words onto memory and play back.
-
Requires large storage space, but little additional hardware.
-
Useful for applications where only a small vocabulary is needed.
-
Analog formant frequency synthesis:
-
Sum together bandpass filters to simulate the various audio filters in
the oral cavity.
-
Allows the flexibility of more different sounds with reduced data storage.
-
Produces unnatural and even unrecognizable sound for text input.
-
Digital vocal tract modeling of speech:
-
Map the actions of the human vocal tract to equations: LPC and PARCOR.
-
Greatly reduces the memory storage.
For more details, refer to Glen Lee's artical Voice
synthesis.
Back to top of this
page.

3.4 Pros and Cons
Importance of voice recognition and synthesis to VR:
-
Speech is the most rapid form of communication, by far faster than typing.
-
Voice provides a fairly natural and intuitive way to controll the simulation
while allowing the user's hands to remain free.
-
Users would gain the greatest feeling of immersion if they could use their
most common form of communication, the voice.
Difficulties in voice synthesis:
-
The interfacing of all the possible combinations and variations of the
audio filters in the oral cavity complicates the task of synthesizing speech.
-
The tasks that appear very natural for the brain are very hard to capture
algorithmically.
Limitations of voice recognition as an input device:
-
Sometimes voice input decreases efficiency. e.g.
command to draw a line.
-
Due to the imprecise feature of voice, computers may have difficulty understandings
words under different contexts, or words that sound alike, especially for
continuous speech. e.g.
"You have an e-mail" may be understood as "You have any male."
-
Most system requires training, where the user has to repeat a word many
times in various ways to let the computer recognize the different patterns
of the sound.
-
Most systems are limited to a vocabulary in the hundreds of words.
Back to top of this page.
Back to
contents of the presentation.
If you have comments or suggestions,
email me at yfeng@cs.indiana.edu