Danfeng Li
Danfeng Li
Graduate Research Assistant

Research Assistant

Research Project Title
Audiovisual Speech Recognition: Data Collection and Feature Extraction in Automotive Environment

Principal Investigators
Mark Hasegawa-Johnson
Thomas Huang
Stephen Levinson

Unit # 19
Project Overview

This project experiments with audiovisual speech recognition using a multisensory visor-mounted array composed of two microphones and a video camera. We will acquire data in realistic environments, develop and apply robust audiovisual feature extraction algorithms, and test the resulting features by training and testing small-vocabulary speech recognition models.

Audio-video recordings of speech will be acquired in realistic noise conditions: engine idling, windows closed at 35mph, windows open at 35mph, windows closed at 65mph, windows open at 65mph. This data will then be used to develop and apply algorithms for robust audiovisual feature extraction. In particular, graduate research assistants working on this research will focus on two problems: (1) Accurate visual tracking of the face and extraction of lip features; and, (2) Extraction of an accurate audio speech recognition feature stream from the two-microphone array. Extracted audiovisual features will be used to train and test four small-vocabulary speech recognizers: two binaural (two-microphone) audiovisual speech recognizers (with different recognition architectures), one binaural audio-only recognizer, and one monaural audio-only recognizer.

The objective of this research is to demonstrate that word error rate (WER) of a binaural audiovisual recognizer is much lower than WER of a monaural audio-only recognizer under typical automotive test conditions.