IEEE Workshop on Three-Dimensional Cinematography (3DCINE'06) Thursday, June 22, New York City (in conjunction with CVPR)

NEW! The final program of the workshop is now online. See below. Registration is now open on the CVPR web site. See you in NYC next June !

General overview

Recent developments in computer vision and computer graphics, especially in such areas as multiple-view geometry and image-based rendering, are now making it possible to generate three-dimensional models of dynamic scenes from multiple cameras at video frame rates. We call this process "three-dimensional cinematography", since it extends traditional cinematography from two dimensions (images) to three dimensions (solid objects which can be rendered with photo-realistic textures from arbitrary viewpoints).

Many working prototypes have already demonstrated this capability in various forms and under different names - virtualized reality, free-viewpoint video, 3D video, etc. Those efforts are being deployed across multiple disciplines and research communities and there is a need for a regular workshop with its roots in computer vision, that would bring all the researchers together.

Recent advances have now clearly shown the promises of 3D cinematography stystems allowing multiple-camera capture, processing, transmission and rendering of 3D models of real scenes in real time. Yet, many problems remain to be solved before such systems can be transposed from blue screen studios to the real world. Problems are both theoretical and practical:

Those are difficult and fascinating questions which will no doubt generate new research directions both in computer vision and in graphics for many years to come.

The goal of this workshop is therefore to bring together researchers in computer vision and graphics who are building three-dimensional cinematography systems to present their solutions to those problems, show their latest results, share insights on current issues and new research directions, and discuss real world applications of their work.

Final program

3DCINE is a a one-day workshop with four sessions covering the areas of large camera arrays; video-based rendering; 3D reconstruction of dynamic scenes; and commercial applications of 3D cinematography. Each session typically consists of two regular papers and two invited paper by leading researchers or practionners. We hope that this first workshop will lead the way to a regular event associated with CVPR on a yearly basis.

The program of the workshop is as follows.

Takeo Kanade and PJ. Narayanan: Historical Perspective of the 4D Virtualized Reality Project (invited paper).

Dynamic events such as a sports event, a ballet performance, or a lecture are of great interest. Recording them digitally for experiencing in a spatiotemporally distant setting requires 4D capture: three dimensions for their geometry/appearance over the fourth dimension of time. Cameras are suitable for this task as they are non-intrusive, universal, and inexpensive. Computer vision techniques have advanced sufficiently to make such 4D capture possible. In this paper, the authors present a historical perspective on the Virtualized Reality (TM) system developed in the mid 90s at CMU for the 4D capture of dynamic events.

Ismail Oner Sebe, Suya You, Ulrich Neumann: Model-Driven Video-Based Rendering for Vehicles (regular paper).

Adrian Hilton : Multiple camera studio production of actor performance (invited talk).

This talk presents a comparison of model-based and model-free reconstruction of people from multiple camera views in a studio environment together with a new approach to the capture and representation of people. Shape and appearance of the reconstructed model are optimised simultaneously based on multiple view silhouette, stereo and feature correspondence. A priori knowledge of surface structure is introduced as regularisation constraints. Model-based reconstruction assumes a known generic humanoid model a priori, which is fitted to the multi-view observations to produce a structured representation for animation. Model-free reconstruction makes no priori assumptions on scene geometry allowing the reconstruction of complex dynamic scenes. Results are presented for reconstruction of sequences of people from multiple views. The model-based approach produces a consistent structured representation, which is robust in the presence of visual ambiguities. This overcomes limitations of existing visual-hull and stereo techniques. Model-free reconstruction allows high-quality novel view-synthesis with accurate reproduction of the detailed dynamics for hair and loose clothing. Multiple view optimization achieves a visual quality comparable to the captured video without visual artefacts due to misalignment of images.

Georgios Litos, Xenophon Zabulis, Georgios Triantafyllidis: Synchronous Image Acquisition based on Network Synchronization (regular paper).

Hansung Kim, Ryuuki Sakamoto, Itaru Kitahara, Kiyoshi Kogure: Cinematized Reality - Cinematographic 3D Video System for Daily Life Using Multiple Outer/Inner Cameras (regular paper).

Abhijit Ogale and Yannis Aloimonos: The way of the future. Segmentation into surfaces: a compositional approach (invited talk).

The success of 3D cinematography relies mainly on our ability to create three dimensional models of the world from images. Current state of the art has reached a plateau because it depends on the accuracy of the matching between images (optic flow or correspondence). The authors believe that a more fruitful approach to 3D models is one that is preceded by an explicit step of surface segmentation, i.e. finding the different surfaces in the scene and their boundaries. They describe a compositional solution that utilizes all available cues, intensity/edges, color, stereo and motion. The solution proceeds by first finding illusory contours. Since a segmentation consists of closed regions or segments, one can inscribe a circle (or closed curve) inside any segment. This is equivalent to a swirling vector field inside the segment. The authors provide a number of constraints on these swirling fields that give rise to a dynamical system described by the Fokker-Planck equations. Segmentation then amounts to building this swirling field. The new theory will be demonstrated by a number of examples.

Zhigang Zhu and Hao Tang : Content-Based Dynamic 3D Mosaics (regular paper).

Takanori Senoh, Terumasa Aoki, Hiroshi Yasuda and Takuyo Kogure: Space-Sampling Method for 3d Cinemas (regular paper).

Takeshi Naemura : Light Field Live With Thousands of Lenslets (invited talk).

This talk introduces a 3D live video system named LIFLET which stands for Light Field Live with Thousands of Lenslets. It can capture thousands of different views of a 3D dynamic scene, and synthesize free-viewpoint images of the scene interactively. The integral photography optics plays an important role in realizing photo-realistic 3D computer graphics. On the other hand, computer graphics techniques solve some optical problems in 3D displays. LIFLET is a result of combining the merits of both fields of research. The goal of this project is to provide a 3D live video system, which could introduce new digital media. It is suitable for various applications, including 3D broadcasting, 3D photometric archiving, and 3D content creation for movies or games. This is not a 3D display technology, but a real-time image-based rendering system that is applicable for living things and complex reflection and refraction in the real world. For this purpose, one needs to capture multiple views of a 3D dynamic scene simultaneously, and it is straightforward to use an array of cameras. One can synthesize free-viewpoint images from the captured multiple views by ray space method or light field rendering. Unfortunately, it is not so easy and cost consuming to develop a large camera-array system with a great number of cameras. To date, the number of cameras in camera-array systems is at most a hundred or so. In order to solve this problem, the author proposes the use of thousands of lenslets. Since lenslets are typically smaller than cameras, such a system is able to acquire denser information of a 3D scene. Moreover, it is not so difficult to develop a large lens-array system. As the talk emphasizes, it is important that the synthesized image should be free of any distortion and maintain correct parallax. LIFLET can remove some optical distortions and utilize view-dependent depth map which is effective for enhancing the resolution of free-viewpoint images.

Dejun Wang: Towards robust and physically plausible shaded stereoscopic segmentation for piecewise constant albedo (regular paper).

Masayuki Tanimoto: Free Viewpoint Television for 3D Scene Reproduction and Creation (invited paper).

FTV captures all visual information of 3D scenes at video frame rates. Captured information is processed in the ray-space for 3D scene reproduction and creation. The author proposed the concept of FTV and verified its feasibility by the world's first real-time system. The author has been developing FTV based on the ray-space representation. In the ray-space representation, one ray in the 3D real space is represented by one point in the ray space. The ray-space is a virtual space. However, it is directly connected to the real space. The ray-space is generated easily by collecting multi-view images with the consideration of camera parameters. The author uses two types of ray-space for FTV. One is the orthogonal ray-space, where a ray is expressed by the intersection of the ray and the reference plane and its direction. Another is the spherical ray-space, where the reference plane is set to be normal to the ray. The orthogonal ray-space is used for FTV with linear camera arrangement whereas the spherical ray-space for FTV with circular camera arrangement. Reproduced and created 3D scenes are displayed on a 360 degree ray producing SeeLINDER display, allowing for multiple viewers to see the 3D scene. It consists of a cylindrical parallax barrier and one-dimensional light source arrays. Semiconductor light sources such as LEDs are aligned in a vertical line for the one-dimensional light source arrays. The cylindrical parallax barrier rotates fast, and the light source arrays rotate to the opposite direction slowly. FTV was proposed to MPEG in 2002 and considered the most challenging scenario in 3DAV (3 Dimensional Audio and Video). MPEG established the framework of MVC (multi-view video coding) in 2004. 11 organizations submitted 8 proposals in response to "Call for Proposals on Multi-View Video Coding" in 2006. They have shown that specific MVC technology outperforms the AVC reference solution significantly in terms of PSNR and subjective quality. Several core experiments are made to evaluate this technology in detail.