ASR as Bayesian inference

The noisy channel model
- The noisy sentence we "observe" is the result of an underlying (source) representation of a sentence passed through a noisy channel.
- Given an observed sentence, we want to infer the most likely underlying sentence that generated it.
An observed sentence is a sequence of observations: `O = o_1, o_2, ..., o_t`.
An underlying (source) sentence is a sequence of words: `W = w_1, w_2, ..., w_n`.
The problem: `hat W = argmax_{W in L} P(W|O)`
Bayes' Law: `P(x|y) = (P(y|x)P(x))/(P(y))`
The problem rewritten:
`hat W = argmax_{W in L} P(W|O) = argmax_{W in L} (P(O|W) P(W)) / (P(O)) = argmax_{W in L} P(O|W) P(W)`
The language model: `P(W)`
The acoustic model: `P(O|W)`
Decoding (search)

Feature extraction / signal processing

Acoustic waveform sampled into a set of overlapping frames of equal lengths (10, 15, or 20 ms)
Each frame transformed to a set of spectral features (around 39) using discrete Fourier transform
Frequencies mapped on a scale more representative of human audition: the mel scale
A final transformation: cepstrum: spectrum of log of normalized spectrum — effectively separates out source and filter in signal
Cepstral coefficients, energy features, change features for cepstral coefficients and energy: MFCC (mel frequency cepstral coefficients)