Scene understanding using audio-visual fusion
PhD thesis
Apply on line for this PhD position
How do we recognize objects that are both seen and heard? In this thesis we propose to study the fusion of auditory and visual information gathered with microphones and cameras in order to build a spatio-temporal audiovisual (AV) map of a scene: Determine the number of audiovisual sources that are present in the scene and select those sources that correspond to humans as opposed to sources that correspond to artifacts. For example, speech sounds will usually be simultaneous with visual motion (such as head and lip movements). Furthermore, if a pair of microphones is used, then the azimuth of an auditory source can be estimated from the inter-aural time-difference (ITD). This azimuth will typically be aligned with the spatial direction of the AV source. Finally, if a pair of cameras is used, then the distance of the source can be estimated from binocular disparity information. This will require a geometric characterization of the sensors, e.g., a binocular camera pair associated with a binaural pair of microphones, as well as a statistical model of the incoming signals.
The thesis will concentrate on the development of machine learning methods for fusing the two sensorial modalities, for extracting as much information as possible from each AV source, for recognizing human sources as opposed to artifacts, and for characterizing the status of each recognized human: speaking or emitting other types of non-speech sounds, involved in a dialog with another human, etc.
The overall approach will consist in, possibly, using unsupervised methods such as non-linear dimensionality reduction (Laplacian embedding, manifold learning) combined with mixture models for learning and classification.
Relevant publications:
Vasil Khalidov, Florence Forbes, Miles Hansard, Elise Arnaud, Radu P. Horaud. Detection and Localization of 3D Audio-Visual Objects Using Unsupervised Clustering. ACM/IEEE International Conference on Multimodal Interfaces (ICMI’08) - October 2008
Elise Arnaud, Heidi Christensen, Yan-Chen Lu, Jon Barker, Vasil Khalidov, Miles Hansard, Bertrand Holveck, Hervé Mathieu, Ramya Narasimha, Elise Taillant, Florence Forbes, Radu P. Horaud. The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements. ACM/IEEE International Conference on Multimodal Interfaces (ICMI’08) - October 2008
Eligibility: We seek a candidate holding a master degree in computer science with expertise in signal and image processing, statistics and probability theory, as well as with excellent programming skills.
Start date: 1 October 2009
Contact person: Radu Patrice HORAUD
Deadline: 4 May 2009

