Research Project Title
Audiovisual Speech Recognition: Data Collection and Feature Extraction
in Automotive Environment
Principal Investigators
Mark Hasegawa-Johnson
Thomas Huang
Stephen Levinson
Unit # 19
Project Overview
This project experiments with audiovisual speech recognition using a multisensory visor-mounted array composed of two microphones and a video camera. We will acquire data in realistic environments, develop and apply robust audiovisual feature extraction algorithms, and test the resulting features by training and testing small-vocabulary speech recognition models.
Audio-video recordings of speech will be acquired in realistic noise conditions: engine idling, windows closed at 35mph, windows open at 35mph, windows closed at 65mph, windows open at 65mph. This data will then be used to develop and apply algorithms for robust audiovisual feature extraction. In particular, graduate research assistants working on this research will focus on two problems: (1) Accurate visual tracking of the face and extraction of lip features; and, (2) Extraction of an accurate audio speech recognition feature stream from the two-microphone array. Extracted audiovisual features will be used to train and test four small-vocabulary speech recognizers: two binaural (two-microphone) audiovisual speech recognizers (with different recognition architectures), one binaural audio-only recognizer, and one monaural audio-only recognizer.
The objective of this research is to demonstrate that word error rate (WER)
of a binaural audiovisual recognizer is much lower than WER of a monaural
audio-only recognizer under typical automotive test conditions.