Abstract Nowadays, it becomes possible to show a live virtual stage application allowing human's real-time participation by motion capture. The big problem in participatory animation is figuring out where to place the camera and in what direction to point it to provide an interesting feedback. In participatory animation such as live stage performance or television show, animation is not specified in the form of giving commands but by real human's action. Unlike filming, pre- and post- editing is not allowed. Therefore, real-time decision on visualizing region should be made for camera placement. There have recently been several works on automatic camera control, but none of them is applicable for improvised virtual stage application because they lack of ability to find a situation in real-time. This paper addresses the issues and a method for automatic determination of camera placement to visualize live virtual stage animation. Our approach is spotting actor's action in real-time by choosing a necessary level among multiple levels of perception and analysis.
Keywords: Automatic Camera Control, Participatory Animation, Intelligent Action Spotting, Virtual Stage, Real-time Application, Intelligent Agent
1. Background
Before turning our attention to camera control problem, LIG had been developing participatory animation systems. That is, at least a user was able to participate into a synthetic stage as an actor not by giving commands to one's avatar but by performing him(her)self. An application was developed using real-time motion capture by which the avatar copies user's motion almost exactly [1]. This application was demonstrated in Orbit'98, which is the biggest annual exhibition of technology in Swiss. After attaching motion sensors to the body, a volunteer from spectators did some performance while lookers-on were watching the synthetic avatar's copying motion(Figure 1). To bring in some varieties, operator was allowed to add some facial expression or speech according to the situation. In spite of the simplicity of the system functionality, people seemed to enjoy the improvised performance mirrored by an avatar.
There was a problem, however. Participant sometimes performed with a whole body but also occasionally tried the motion with only small portions of the body, such as fingers, arms or feet so as to exhibit local motions more clearly. Even though there was an operator who was adjusting camera placement while monitoring the participant's action, he often failed to catch up and zoom in the body part of interest appropriately. Also, from time to time the performer disappeared out of the screen. At last, participant gave up non-viewable motions and tried to stick to the current screen area.
Similar problems occur in mixed reality applications
such as live TV show, where real and virtual humans act live together in a
virtual stage. If scenes are portrayed only from a particular point of view or
from a small set of strategically placed viewpoints, it is not easy to bring
intentional movements, especially detail motions into a focus at an adequate
moment.
From this experience, we learned the practical need of automatic camera control as one of the functions of real-time participatory animation system.
This paper addresses the problem of camera placement for real-time participatory animation performed in a stage. Let us call this type of animation as virtual stage application. As aforementioned, improvised animation, live television show in mixed reality and virtual theater play may belong to virtual stage applications because they all expect spectators and non recorded play. In recent years some work has been done on automatic camera control for digital animation [2, 3, 4, 5, 6]. Different from ours, many of them aimed to apply cinematic knowledge in controlling camera for movie-type digital animation where expected scenes are known in advance. On the other hand, our target domain is improvised stage performance such as live TV show. Our problem area thus concerns the following issues that were not covered in earlier work.
2. Related Work
Early work in 3D animation was devoted to present convenient metaphors through which objects or scenes were drawn for a particular character's point of view [7]. On the other hand, several recent efforts have begun to address intelligent camera control. Blinn [8] suggested camera specification considering `what should be placed in the frame' rather than just describing where the camera should be. Most work after Blinn's has considered adding cinematic knowledge to decide camera placement. ESPLANADE system by Karp [9, 10] employed film idioms to decide camera placement. In CINEMA system [2], Drucker implemented generic low-level camera library and his later work(CamDroid system [3] ) provided a visual tool allowing the animator to construct encapsulated camera module from environmental constraints. Quite similar work was done by Christianson et al. [4] but with a declarative approach for describing domain-specific strategy for shot transition.
Those systems assumed that the animator could see to-be-animated scenes in advance and then defined camera strategy. Also, when to change shot type was dependent on explicit indication through event generation.
Bares [5]
didn't require the animator to see the scenes and to construct the strategy in
advance, but asked user's preference for camera control. According to that, the
camera planning was done in real-time, but within the same preference, shot
characteristics were decided probabilistically. This may work well with
animation of single-body, sold objects in such a case of vehicle navigation, but
it cannot make shot property vary correspondingly to visualize dynamic changes
of interesting motion area of articulated bodies.
3. Modeling of Active Region and Camera Shot
To visualize real-time performance, we have a categorized model of target regions to be shown. The regions are basically conceptual and the actual scope of current active region is determined dynamically. At every moment, only one of the regions will be spotted as a main visualizing target. Each region also leads to different type of camera shots .
1) Modeling of Active Actors / Active Regions
Determining current camera shot depends on finding focusing region to be visualized in current frame. The region usually intersects with active actors who currently perform significant motion in the virtual stage. An actor's current activeness is represented by the degree of deviation from his resting state. The resting state definition for an actor is dependent on the actor's role. In a performance, there can be main actors and other less dominant actors such as back dancers. Background actors can gain a focus if they are active while main actors are not.
Focusing on the active actors, active region mainly contains the body portion relevant to generating current intentional motion. Many of the regions are also relevant to pleasing cutting heights that cinematographers have identified [11]. The regions are classified as below.
2) Modeling of Shots
Between the starting and stopping of filming a particular scene, a shot represents continuity of all the camera parameters over that period of time. A shot is thus described by shot bodies, type of camera placement which also implies the transition type(gradual / jump) between shots. In turn, the camera placement is decided so as to frame the current active region of shot bodies, constrained by spectator's field of view (spectator's constraint) . The camera motion type such tracking, zoom-in or out or panning is not explicitly specified. Those motions are automatically generated to meet the goal of camera placement for the given shot.
In addition to region-based shot description, that is,
close_up, close_shot, medium_shot, full_shot and long_take_shot, there is a
master shot. Master shot behaves as a default shot when there is no
current active region. This shot usually includes all the main actors with a
default cutting height of the domain. After transited to a new shot, if no
significant motion is generated by any of the actors for some amount of time,
the camera gradually returns to master_shot.
4. Real-time Action Spotting
Because of the `improvising' property of participatory animation, it is not probable to plan camera strategy in advance for specific types of shots and actor's geometry. The camera placement thus should be decided from limited information mainly based on the observation on the actor's current performance.
Our approach is then a reactive camera control using real-time action spotting. No other declarations than active regions and basic types of shots are made. The difficulty of real-time analysis of whole body action mainly comes from many degrees of freedom of target bodies. One way to reduce this complexity is regarding the necessary part only and doing minimal analysis. That is, with multiple levels of perception, action spotting goes through most detail analysis or recognition only when necessary. This section describes the whole architecture for action analysis and the spotting algorithm.
1) Architecture of CAIAS
Architecture of CAIAS(Camera Agent based on Intelligent Action
Spotting) consists of several modules as shown in figure 3. Perceiver is
responsible for getting motion data from sensors. The perception, however, is
not always applied for the whole region all the same. Told by camera planner
`what to sense', the perceiver only elicits the data for a selected region in a
specified level of sensing detail. Reactive camera planner decides in which
level of detail the motion data should be perceived. It also communicates with
motion analyzer if needed. Finally, it decides the region to visualize and sends
the calculated camera control parameters to the actuator module.
Figure 3. Architecture of CAIAS
2) Level of Perception / Analysis
The perception of human action data can be carried out in three levels :
Figure 4. Reactive Camera Planner and Motion Analyzer
The levels do not mean the hierarchical ordering of action
spotting process. They represents existing complexity of perception and
analysis. Reactive camera planner chooses current perception/analysis level by
regarding shot priority or by the result of action spotting. The following
paragraph summarizes the camera placement decision procedure.
1. Priority is given to close up facial motion if any. That
is, whenever there is an actor's facial motion, camera shot changes for close_up
her(his) face. Close up should be done fast because facial expression
usually stays for very short moment, so the type of transition to this shot is
quick jump style. The returning to the previous shot type, however, occurs
slowly like normal speed transition.
2. If no
facial motion is perceived, next check will be done about actor's position
movement that affects camera to change to a long distant shot with or without
tracking the actor. In case there is a position change, it means the actor is
stepping around the place, or walking to the other location. Since this is
checked after the facial motion, the camera can jump to close up the facial
motion even while tracking a walking motion or viewing scenes in a long
shot.
3. If all the above motions didn't take
place, the pivotal points for each region will be investigated in turn, from the
lower body to upper body. If the knee pivotal points reveal there is a lower
body movement, the whole body motion will be visualized with a full shot camera
placement.
4. Else if there detected a motion by
the displacement of hip pivotal points, camera will be placed in a medium
shot.
5. If all the above investigation fails,
it means the motion is not occurring through global body. Then the possible
upper body action in the close_shot region will be explored. In this case, the
posture or gesture recognition of complex hand gestures can be
attempted.
6 As a default, if none of the
movement takes place during some amount of time, the shot will be transit to
master_shot.
3) Real-time Action Spotter
There have been just a few work about full body action recognition [12, 13]. They attempted simple feature-based recognition which is tedious to program if the number of vocabulary and motion complexity increase. Also, in those earlier work all action set should be traversed at every frame to find a suitable interpretation.
On the other hand, we use multi-level spotting mechanism according to the complexity of perception input with region reduction. All those analyzers, however, are not always active in every frame. The more complex the analyzer is, the less it's frequency of application would be. Also, only selective actions according to the context will be traversed each frame.
Our action spotter consists of feature-based simple analyzer,
neural network based posture recognizer and hidden Markov model based stochastic
analyzer. They processes level 0, 1, 2 respectively.
5. Experiment : Digital Improvisation
LIG developed Virtual Human Director system(VHD) that provides a interactive tool for animation control. Based on this system, a live virtual stage application was developed. Using this application, the real human can participate as an actor in the virtual theater and improvise the play. Or the actors were also allowed to be animated through interactive control. Our automatic camera control strategy added dramatic effect on the improvisation. Without any knowledge of the scenario, the intelligent camera generates dynamic scene. Even there is no declaration of cinematic knowledge, the resulting shot shows much coincidence with already known cinematic rules such as so-called 'over the shoulder shot'. The movie clips 1 to 8 shows the shots applied for an example theater play.
Movie clip1 shows one shot of the play where the camera
motion is zooming the actors in. Initially the camera was in a long
distance shot to visualize all the stage, but after some amount of duration
without actor's movement, the camera automatically zoom the actors in using
bounding sphere of them and a cutting height of master_shot. Since master
shot is a default shot, the view angle is decided to look at the spectators
straightforward.
Movie clip 2 shows what happen when an
actor begins to walk. At the beginning of the walking motion, the camera
draw back to long distant shot because there is a location change of an
actor. The HMM based continuous action spotter will be initiated, but it
cannot recognize the stepping motion as a walking stage until the actor steps at
least once with both legs. After stepping twice, the action spotter began
to understand it is a walking motion and let the camera focus and track the
moving actor.
In Movie clip 3, you can see several camera shots - master shot, close_shot and close_up. Since there was no movement for some duration, the camera returns to the master shot that presents both of the actors to spectators. Then the actress began to weep and the camera slowly zooms her upper-body to visualize her weeping motion. In this shot, the action spotter have attempted to recognize the arm motion, but it fails to spot any gesture. Since the actor then generates some emotional expression in his face, the camera jump to visualize to his face more closely. Otherwise, this kind of facial emotion is hardly viewable.
Movie clip 4 and 5 shows again the close_up shot after close_up, and returning to the master shot, respectively.
In movie clip 6, both of the actors act at the same time. Therefore, the bounding sphere will surround both of the actor's moving part. The resulting shot is visualizing the upper body of both of the actors. We should also notice that the camera rotates into the angle that it can visualize both of the actor's action. In this case, the rotating angle of the camera does not exceed the spectator's viewable angle.
Readers may wonder what happen when both of the actors are
confronting the back wall of the stage. In that case, camera does not
rotate to visualize the both of the actors face. In the `stage'
application such as theater play or TV show, it is the basic rule of act that
the actors do now show their back to the spectators unless it is intended to do
so. Therefore, the camera does not attempt to rotate to visualize the
front side of the body when they stand toward the backside of the stage.
Movie clip 7 shows the medium cut to visualize the dancing
motion. The actress dances lightly, and the camera turns to focus the active
actor. Finally, the movie clip 8 shows what happen when the actress shows some
emotional expression while she is walking. The walking motion is visualize with
tracking in a long distant shot and the camera quickly closes the face up when
the facial expression is generated and slowly returns to the previous tracking
mode. All these camera controls occurred automatically only with the minimal
input of rules including comfortable cutting heights and spectator's
constraints. Basically, the camera control module does not have to know if
current shot is full shot or close_up because it is automatically determined one
using bounding sphere of active actors.
7. Conclusion and Future Work
This paper described a reactive camera control strategy based on real-time action spotting, whose flexibility supports for live virtual stage application. To meet with real-time requirement, we have employed region granularity reduction mechanism and applied minimal analysis on motion data. We do not always attempt to analyze the whole body action in detail. That is, within the reduced region, complex action spotting module is only carried out when it is necessary. Once the shot type is unfold by spotting of active region and action type, the camera is easily placed by regarding spectator's constraint and the extension line of bounding sphere surrounding the region. As a demonstration, we developed CAIAS system running with a live stage performance application using motion capture. Our method showed interesting camera work in improvising situation. Even though cinematic rules based on geometry between actors are not declared in any form in our system, the typical scenes usually examplified in earlier work are visualized quite correspondingly to the cinematic guideline.
References
[1] Selim Balcisoy et al. An Interactive Interface for
Directing Virtual Humans. In Proc. ISCIS 98, IOS Press, 1998.
[2] Steven M. Drucker et al. CINEMA: A system for procedural
camera movements. In David Zeltzer, editor, Computer Graphics (1992 Symposium
on Interactive 3D Graphics), volume 25, pages 67-70, March 1992.
[3] Steven M. Drucker and David Zeltzer. CamDroid: A system
for implementing intelligent camera control. In Mychael Zyda, editor,
Computer Graphics (1995 Symposium on Interactive 3D Graphics), volume 28,
pages 139-144, April 1995.
[4] David B. Christianson et
al. Declarative Camera Control for Automatic Cinematography. In Proceedings
of the AAAI-96, August 1996.
[5] William H. Bares
and James C. Lester, Cinematic User Models for Automated Realtime Camera Control
in Dynamic 3D Environments. In Anthony Jameson, et al. (Eds.), User Modeling
: Proceedings of the Sixth International Conference, UM97. Vienna, New York:
Springer Wien New York. 1997.
[6] Li-wei He et al. The
Virtual Cinematographer : A Paradigm for Automatic Real-Time Camera Control and
Directing. In Computer Graphics (SIGGRAPH'96 proceedings), pages 217-224,
August 1996.
[7] Jock D. Mackinlay et al. Rapid
Controlled Movement Through a Virtual 3D Workspace. Computer
Graphics(SIGGRAPH'90 proceedings), volume 24, number 4, pages 171-176,
August 1990.
[8] Jim Blinn, Where Am I? What Am I
Looking At? IEEE Computer Graphics and Applications, pages 76-81,
1988.
[9] Peter Karp and Steven Feiner. Issues in the
automated generation of animated presentations. In Proceedings of Graphics
Interface '90, pages 39-48, May 1990.
[10] Peter
Karp and Steven Feiner. Automated presentation planning of animation using task
decomposition with heuristic reasoning. In Proceedings of Graphics Interface
'93, pages 118-127, Toronto, Ontario, Canada, May 1993. Canadian Information
Processing Society.
[11] Daniel Arijon, Grammar of the
Film Language. Communication Arts Books, Hastings House, Publishers, New York,
1976
[12] Pattie Maes et al. The ALIVE system:
Full-body interaction with Autonomous Agents, Proceedings of the Computer
Animation '95 Conference, Geneva, Switzerland, IEEE-Press, April
1995.
[13] Luc Emering et al. Interacting with Virtual
Humans through Body Actions, IEEE Computer Graphics and Applications,
vol. 18, no. 1, pages 8-11, 1998.
[14] Yanghee Nam et
Kwangyun Wohn, Recognition of hand gestures with 3D, nonlinear arm movements,"
Pattern Recognition Letters, vol. 18/01, pp105-113, 1997.