Parts-based Models and Local Features for Automatic Speech Recognition

This page gives some information on my PhD thesis, Parts-based Models and Local Features for Automatic Speech Recognition, which was completed in May 2009. It was done at MIT in the SLS (Spoken Language Systems) group within CSAIL (Computer Science and Artificial Intelligence Laboratory). My thesis advisor was Jim Glass.

For people with some familiarity with speech recognition and spectrograms, the "visual synopsis" below gives a quick overview of the main ideas presented in the thesis. Below that, I'll include the standard abstract. I'd be happy to take any comments or questions.

Full document

K. Schutte, Parts-based Models and Local Features for Automatic Speech Recognition. MIT Department of Electrical Engineering and Computer Science, June 2009.
Download PDF

Visual synopsis

The high-level approach

We seek to model phonetic units as deformable templates of spectro-temporally localized cues rather than modeling them as sequences of fixed spectral profiles.

Parts-based models

Following work in machine vision, we use "parts-based" models (PBMs) to implement these ideas. Phoentic units are encoded with graphical models: the "parts" (individual phonetic cues) are nodes, and the dependencies between them are edges.

An early example of parts-based modeling in vision -- detecting a face as a collection of individual parts (eyes, nose, etc). The springs represent "deformable" relative locations between parts. [Fischler and Elschlager, IEEE Trans. on Comp., 1973]

Applying the parts-based approach to speech, the parts represent distinct phonetic cues. Shown above is an example for the spoken letter "B" (diphone /b iy/).

(a) An example "B" spectrogram with known phonetic cues highlighted.

(b) Encoding these cues in a PBM with simple "time-frequency patch" detectors for each cue. The T-F location where each part occurs is flexible: thus connections are shown as springs.

"Speech schematic" models

We present a variant of PBMs of speech, a "speech schematic" model, which directly encodes a "cartoon" version of a typical spectrogram. We argue that these models could offer a better generative model of speech than typical the HMM/MFCC models.

Shown above is a schematic model of the letter "H". Major onsets, offsets, and formant transitions are modeled with simple edge-detector parts. Edges between parts encode the constraints of how these parts are configured relative to each other.


While automatic speech recognition (ASR) systems have steadily improved and are now in widespread use, their accuracy continues to lag behind human performance, particularly in adverse conditions. This thesis revisits the basic acoustic modeling assumptions common to most ASR systems and argues that improvements to the underlying model of speech are required to address these shortcomings.

A number of problems with the standard method of hidden Markov models (HMMs) and features derived from fixed, frame-based spectra (e.g. MFCCs) are discussed. Based on these problems, a set of desirable properties of an improved acoustic model are proposed, and we present a "parts-based" framework as an alternative. The parts-based model (PBM), based on previous work in machine vision, uses graphical models to represent speech with a deformable template of spectro-temporally localized "parts", as opposed to modeling speech as a sequence of fixed spectral profiles. We discuss the proposed model's relationship to HMMs and segment-based recognizers, and describe how they can be viewed as special cases of the PBM.

Two variations of PBMs are described in detail. The first represents each phonetic unit with a set of time-frequency (T-F) "patches" which act as filters over a spectrogram. The model structure encodes the patches' relative T-F positions. The second variation, referred to as a "speech schematic" model, more directly encodes the information in a spectrogram by using simple edge detectors and focusing more on modeling the constraints between parts.

We demonstrate the proposed models on various isolated recognition tasks and show the benefits over baseline systems, particularly in noisy conditions and when only limited training data is available. We discuss efficient implementation of the models and describe how they can be combined to build larger recognition systems. It is argued that the flexible templates used in parts-based modeling may provide a better generative model of speech than typical HMMs.

© 2012 Ken Schutte
Contact me