NEW! The final program of the workshop is now online. See below.
Registration is now open on the CVPR web site. See you in NYC next June !
General overview
Recent developments in computer vision and computer graphics, especially in
such areas as multiple-view geometry and image-based rendering, are now
making it possible to generate three-dimensional models of dynamic scenes from
multiple cameras at video frame rates. We call this process "three-dimensional
cinematography", since it extends traditional cinematography from two dimensions
(images) to three dimensions (solid objects which can be rendered with photo-realistic
textures from arbitrary viewpoints).
Many working prototypes have already demonstrated this capability in various
forms and under different names - virtualized reality, free-viewpoint video,
3D video, etc. Those efforts are being deployed across multiple disciplines
and research communities and there is a need for a regular workshop with its
roots in computer vision, that would bring all the researchers together.
Recent advances have now clearly shown the promises of 3D cinematography stystems
allowing multiple-camera capture, processing, transmission and rendering of
3D models of real scenes in real time. Yet, many problems remain to be solved
before such systems can be transposed from blue screen studios to the real world.
Problems are both theoretical and practical:
- Scaling issues. How many cameras should be used to fully
sample a complex, dynamic scene with multiple actors? How accurately should
the cameras be synchronized and calibrated ?
- Representation issues. How can the geometry and texture
of a scene be separately extracted and represented independently of the original
viewpoints, and at which levels of details ?
- Modeling issues. Which information can be realiably extracted
from video streams to allow temporally consistent 3D reconstruction ? How
precisely can the geometry and texture of a scene be recovered in real world
situations? How can prior knowledge about the scene geometry and appearance
be used ?
- Implementation issues. How can the massive amounts of multiple
video and geometry streams generated by 3D cinematography be stored, processed
and rendered efficiently ? How can such processing be performed on-line at
video frame rate, without sacrificing quality?
- Human factors and aesthetic issues. How will end users
navigate within 3D cinematographic scenes ? Will 3D cinematography produce
exact copies of the real world ? Or will it evolve into a more elaborate,
yet to be discovered, art form ?
Those are difficult and fascinating questions which will no doubt generate
new research directions both in computer vision and in graphics for many years
to come.
The goal of this workshop is therefore to bring together researchers
in computer vision and graphics who are building three-dimensional cinematography
systems to present their solutions to those problems, show their latest
results, share insights on current issues and new research directions, and discuss
real world applications of their work.
Final program
3DCINE is a a one-day workshop with four sessions covering the areas of large
camera arrays; video-based rendering; 3D reconstruction of dynamic scenes; and
commercial applications of 3D cinematography. Each session typically consists
of two regular papers and two invited paper by leading researchers
or practionners. We hope that this first workshop will lead the way to a regular
event associated with CVPR on a yearly basis.
The program of the workshop is as follows.
Takeo Kanade and PJ.
Narayanan: Historical Perspective of the 4D Virtualized Reality Project (invited paper).
Dynamic events such as a sports event, a ballet performance, or a
lecture are of great interest. Recording them digitally for
experiencing in a spatiotemporally distant setting requires 4D
capture: three dimensions for their geometry/appearance over the
fourth dimension of time. Cameras are suitable for this task as
they are non-intrusive, universal, and inexpensive. Computer
vision techniques have advanced sufficiently to make such 4D
capture possible. In this paper, the authors present a historical
perspective on the Virtualized Reality (TM) system developed in
the mid 90s at CMU for the 4D capture of dynamic events.
Ismail Oner Sebe, Suya You, Ulrich
Neumann: Model-Driven Video-Based Rendering for Vehicles (regular paper).
Adrian Hilton : Multiple camera studio production of actor performance (invited talk).
This talk presents a comparison of model-based and model-free
reconstruction of people from multiple camera views in a studio
environment together with a new approach to the capture and
representation of people. Shape and appearance of the
reconstructed model are optimised simultaneously based on multiple
view silhouette, stereo and feature correspondence. A priori
knowledge of surface structure is introduced as regularisation
constraints. Model-based reconstruction assumes a known generic
humanoid model a priori, which is fitted to the multi-view
observations to produce a structured representation for animation.
Model-free reconstruction makes no priori assumptions on scene
geometry allowing the reconstruction of complex dynamic scenes.
Results are presented for reconstruction of sequences of people
from multiple views. The model-based approach produces a
consistent structured representation, which is robust in the
presence of visual ambiguities. This overcomes limitations of
existing visual-hull and stereo techniques. Model-free
reconstruction allows high-quality novel view-synthesis with
accurate reproduction of the detailed dynamics for hair and loose
clothing. Multiple view optimization achieves a visual quality
comparable to the captured video without visual artefacts due to
misalignment of images.
Georgios Litos, Xenophon Zabulis, Georgios Triantafyllidis: Synchronous Image Acquisition based on Network Synchronization (regular paper).
Hansung Kim, Ryuuki Sakamoto, Itaru Kitahara, Kiyoshi
Kogure: Cinematized Reality - Cinematographic 3D Video System for Daily Life Using Multiple
Outer/Inner Cameras (regular paper).
Abhijit Ogale and Yannis Aloimonos: The way of the future. Segmentation into surfaces: a compositional approach (invited talk).
The success of 3D cinematography relies mainly on our ability to
create three dimensional models of the world from images. Current
state of the art has reached a plateau because it depends on the
accuracy of the matching between images (optic flow or
correspondence). The authors believe that a more fruitful approach
to 3D models is one that is preceded by an explicit step of
surface segmentation, i.e. finding the different surfaces in the
scene and their boundaries. They describe a compositional solution
that utilizes all available cues, intensity/edges, color, stereo
and motion. The solution proceeds by first finding illusory
contours. Since a segmentation consists of closed regions or
segments, one can inscribe a circle (or closed curve) inside any
segment. This is equivalent to a swirling vector field inside the
segment. The authors provide a number of constraints on these
swirling fields that give rise to a dynamical system described by
the Fokker-Planck equations. Segmentation then amounts to building
this swirling field. The new theory will be demonstrated by a
number of examples.
Zhigang Zhu and Hao Tang : Content-Based Dynamic 3D Mosaics (regular paper).
Takanori Senoh, Terumasa Aoki, Hiroshi Yasuda and Takuyo Kogure: Space-Sampling Method for 3d Cinemas (regular paper).
Takeshi Naemura : Light Field Live With Thousands of Lenslets (invited talk).
This talk introduces a 3D live video system named LIFLET which
stands for Light Field Live with Thousands of Lenslets. It can
capture thousands of different views of a 3D dynamic scene, and
synthesize free-viewpoint images of the scene interactively. The
integral photography optics plays an important role in realizing
photo-realistic 3D computer graphics. On the other hand, computer
graphics techniques solve some optical problems in 3D displays.
LIFLET is a result of combining the merits of both fields of
research.
The goal of this project is to provide a 3D live video system,
which could introduce new digital media. It is suitable for
various applications, including 3D broadcasting, 3D photometric
archiving, and 3D content creation for movies or games. This is
not a 3D display technology, but a real-time image-based
rendering system that is applicable for living things and complex
reflection and refraction in the real world.
For this purpose, one needs to capture multiple views of a 3D
dynamic scene simultaneously, and it is straightforward to use an
array of cameras. One can synthesize free-viewpoint images from
the captured multiple views by ray space method or light field
rendering. Unfortunately, it is not so easy and cost consuming to
develop a large camera-array system with a great number of
cameras. To date, the number of cameras in camera-array systems is
at most a hundred or so.
In order to solve this problem, the author proposes the use of
thousands of lenslets. Since lenslets are typically smaller than
cameras, such a system is able to acquire denser information of a
3D scene. Moreover, it is not so difficult to develop a large
lens-array system. As the talk emphasizes, it is important that
the synthesized image should be free of any distortion and
maintain correct parallax. LIFLET can remove some optical
distortions and utilize view-dependent depth map which is
effective for enhancing the resolution of free-viewpoint images.
Dejun Wang: Towards robust and physically plausible shaded stereoscopic
segmentation for piecewise constant albedo (regular paper).
Masayuki Tanimoto: Free Viewpoint Television for 3D Scene Reproduction and Creation (invited paper).
FTV captures all visual information of 3D scenes at video frame
rates. Captured information is processed in the ray-space for 3D
scene reproduction and creation. The author proposed the concept
of FTV and verified its feasibility by the world's first real-time
system. The author has been developing FTV based on the ray-space
representation. In the ray-space representation, one ray in the 3D
real space is represented by one point in the ray space. The
ray-space is a virtual space. However, it is directly connected to
the real space. The ray-space is generated easily by collecting
multi-view images with the consideration of camera parameters. The
author uses two types of ray-space for FTV. One is the orthogonal
ray-space, where a ray is expressed by the intersection of the ray
and the reference plane and its direction. Another is the
spherical ray-space, where the reference plane is set to be normal
to the ray. The orthogonal ray-space is used for FTV with linear
camera arrangement whereas the spherical ray-space for FTV with
circular camera arrangement.
Reproduced and created 3D scenes are displayed on a 360 degree ray
producing SeeLINDER display, allowing for multiple viewers to see
the 3D scene. It consists of a cylindrical parallax barrier and
one-dimensional light source arrays. Semiconductor light sources
such as LEDs are aligned in a vertical line for the
one-dimensional light source arrays. The cylindrical parallax
barrier rotates fast, and the light source arrays rotate to the
opposite direction slowly.
FTV was proposed to MPEG in 2002 and considered the most
challenging scenario in 3DAV (3 Dimensional Audio and Video). MPEG
established the framework of MVC (multi-view video coding) in
2004. 11 organizations submitted 8 proposals in response to "Call
for Proposals on Multi-View Video Coding" in 2006. They have shown
that specific MVC technology outperforms the AVC reference
solution significantly in terms of PSNR and subjective quality.
Several core experiments are made to evaluate this technology in
detail.