CAMIL Computational Audio-Motor Integration through Learning

Experimental Setup and Data Acquisition

Hardware & Experimental Setup

Data were recorded using the Sennheiser MKE 2002 dummy-head, linked to a computer via a Behringer ADA8000 Ultragain Pro-8 digital external sound card. The head was mounted onto the University of Coimbra's audiovisual robot head POPEYE with four rotational degrees of freedom: a pan motion, a tilt motion, as well as two additional degrees of freedom for eye vergence. This device was specifically designed to achieve precise and reproducible movements with a very good signal-to-noise ratio. The emitter —a loud-speaker— was placed at approximately 2.7 meters ahead of the robot. The loud-speaker's input and the microphones' outputs were handled by two synchronized sound cards in order to simulteneously record and play. The experiment was carried out in real-world conditions, i.e., a room with reverberations and background noise.

Motor States & Direct Kinematics

The emitter remained fixed during the whole experiment while the robot was placed in different (pan,tilt) motor states. Each sound recorded was precisely annotated with its corresponding (α,β) ground truth motor state. The ground truth 3D position (xs,ys,zs) of the sound source in the robot frame at each record can therefore be retrieved using the following direct kinematics formula:

This model needs two parameters: the distance from the tilt-axis to the microphones' midpoint, r=0.22m, and the distance from this midpoint to the emitter, d=2.70m.

Audio Recordings: Version 0.1 (Download)

This dataset was used in the paper Learning the Direction of a Sound Source Using Head Motions and Spectral Features[1]

The robot was placed in 16,200 motor states: 180 pan rotations α in the range [-180°,180°] and 90 tilt rotations β in the range [-90°,90°] (one recording every 2°). One static and two dynamic binaural recordings of one second were made at each motor states, while the speaker emitted a reference and a random spectrum artificial sound, totalling 6 binaural records per motor state. The name and location of emitted and recorded sounds in our dataset are summarized in the following table:

A pan movement is a head rotation rightwards at constant speed dα/dt=9°/s, centered on the current motor state. A tilt movement is a head rotation downwards at constant speed dβ/dt=9°/s, centered on the current motor state. Emitted sounds were generated using:

where l(t) is the loud-speaker's membrane displacement as a function of time t, K is a real positive constant coresponding to the global volume, F={} is a fixed set of N frequency channels, {ωi}i=1..N ∈ ]0,1]N and {φi}i=1..N ∈ [0,2π]N are weights and phases associated with each frequency channel. In practice, a set of N=600 frequency channels F={50,150,250...5950} was used. The reference sound was defined by fixed values for ωi and φi and was the same during the whole experiment. A different random spectrum sound was emitted at each motor state by drawing ωi and φi from a uniform distribution.

Audio Recordings: Version 1.1 (Download)

This dataset was used in the papers A Latently Constrained Mixture Model for Audio Source Separation and Localization [2], 2D Sound-Source Localization on the Binaural Manifold [3] and Variational EM for Binaural Sound-Source Separation and Localization

The robot was placed in 10,800 motor states: 180 pan rotations α in the range [-180°,180°] and 60 tilt rotations β in the range [-60°,60°] (one recording every 2°). At each motor state, three static binaural recordings each were made while the speaker emitted different sounds. Sound 1 corresponds to one second of white noise, and can be used for training (section 3 in [2] or section 4 in [3]). Sounds 2 corresponds to a random utterance from the TIMIT speech dataset and can be used for testing. Sound 0 corresponds to a silent speaker ("room silence") and can be used to determine the acoustic level threshold during tests (section 4 in [2] or section 5 in [3]).