Huawei/3DLife ACM Multimedia Grand Challenge for 2013

3D human reconstruction and action recognition from multiple active and passive sensors

This challenge calls for demonstrations of methods and technologies that support 3D reconstruction of moving humans from multiple calibrated and remotely located RGB cameras and/or consumer depth cameras. Real-time or near real-time solutions are also sought. Additionally, this challenge also calls for methods for human gesture/movement recognition from multimodal data. The challenge targets applications such as collaborative immersive environments and inter-personal communications over the Internet or other dedicated networking environments.

To this end, we provide two data sets to support investigation of various techniques in the fields of 3D signal processing, computer graphics and pattern recognition, and enable demonstrations of various relevant technical achievements.

Consider multiple distant users, which are captured by their own visual capturing equipment, ranging from a single Kinect (simple user) to multiple Kinects and/or high-definition cameras (advanced users), as well as non-visual sensors, such as Wearable Inertial Measurement Units (WIMUs) and multiple microphones. The captured data is either processed at the capture site to produce 3D reconstructions of users or directly coded and transmitted, enabling rendering of multiple users in a shared environment, where users can "meet" and "interact" with each other or the virtual environment via a set of gestures/movements.

Of course, we are not expecting participants to this challenge to recreate this scenario completely, but rather work with the provided data sets to illustrate key technical components that would be required to realize a relevant scenario. The challenges that may be addressed include, but are not limited to:

  • Realistic (on-the-fly) 3D reconstruction of humans, in the form of polygonal meshes (and/or point clouds), based on noisy source data from calibrated (geometrically and photometrically) cameras.
  • Fast and efficient compression/coding methods for dynamic time-varying meshes or multi-view RGB+depth video that will enable the real-time transmission of data over current and future network infrastructures.
  • Realistic free-view-point rendering of humans, either from full-geometry 3D reconstructions via standard computer graphics, or via view interpolation from original multiple RGB (+Depth) views.
  • Fast and accurate motion tracking of humans (e.g. in the form of skeleton tracking) from the multiple provided data streams.
  • Efficient recognition of human gestures/movements from multimodal data, including RGB and/or depth video, WIMU data and audio.

Dataset

20th March 2013: The data is now ready and is available for download for REGISTERED users.


  • Dataset 1 was captured at CERTH/ITI in Greece and consists of synchronized RGB-plus-Depth video streams of multiple humans in multiple actions captured by five Kinects, as well as multiple-Kinects audio and WIMU streams.

  • Dataset 2 was captured at Fraunhofer HHI in Berlin and consists of synchronized multi-view HD video streams, of multiple humans in multiple actions.

Dataset 1

Two capture sessions, with different spatial arrangements of the sensors, were carried our at CERTH. 17 human subjects have been captured, all performing at least 5 instances of 22 different types of gestures/movements. This means that the dataset consists of approximately 2x17x22x5 = 3740 captured gestures.

The performed actions can be classified into i) Simple actions that involve mainly the upper human body, ii) Training exercises, iii) Sports related activities and iv) Static gestures. They are summarized in the Tables below.

Before each gesture/movement (action) is performed, the name of the action is announced loudly, so that it is captured in the audio streams. Then, for each action instance, the user counts the instance number aloud.


Table 1. Simple actions

Hand waving

Knocking on the door

Clapping

Throwing

Punching

Push away with both hands


Table 2. Training exercises

Jumping Jacks

Lunges

Squats

Punching and then kicking

Weight lifting




Table 3. Sports activities

Golf drive

Golf chip

Golf putt

Tennis forehand

Tennis backhand

Walking on the treadmill


Table 4. Static gestures

Arms folded

T-Pose

Hand on the hips

T-Pose with bent arms

Forward arms raise


Session #1

In the first capture session, the data includes synchronized:

  • Multiple RGB-plus-Depth video streams from 5 Kinect sensors, at different viewpoints covering the whole body.

  • Audio streams captured by the 5 Kinect sensors.

  • Inertial sensor data captured from multiple sensors on the human subject’s body.

  • Kinect cameras’ intrinsic and extrinsic calibration data.

  • Annotation data for the segmentation of each stream into the performed gesture instances.

The capturing setup/spatial arrangement is described in the next Figure. Care was taken, to minimize interference between multiple Kinects.


Session #2

In the 2nd capture session, the data includes synchronized:

  • RGB-plus-Depth video streams from 2 Kinect sensors, at different viewpoints covering the front body part.

  • Audio streams captured by the 2 Kinect sensors.

  • Inertial sensor data captured from multiple sensors on the human subject’s body.

  • Kinect cameras’ intrinsic and extrinsic calibration data.

  • Annotation data for the segmentation of each stream into the performed gesture instances.


In session #1, multiple Kinects were used to enable full body reconstruction of humans. However, the interference between multiple Kinects degrades the quality of the captured depth maps. This makes standard depth-based skeleton tracking algorithms (e.g. the standard OpenNI skeleton tracking module) less accurate. Additionally, in session #1, only one Kinect is placed horizontally. Standard skeleton tracking modules (such as the OpenNI one) can work only with depth videos captured by horizontal Kinects. Finally, the horizontal front-facing Kinect could capture the human body only above the knees.

Due to the above reasons, in session #2 two horizontal Kinects were used and the user was placed at a large distance (~3m) from the sensors. The dataset of session #2 is more appropriate for Motion Capture-based gesture/movement recognition, but less appropriate for 3D reconstruction tasks, due to the large distance of the Kinects to the captured user.

The capturing setup/spatial arrangement in session #2 is described in the next Figure.





Exactly the same set of actions were captured in session #2, as in session #1, except from the static gestures (see table 4), since they are less meaningful for an action recognition task.

In total, 14 human subjects were captured during session #2: 5 of these subjects were captured with Kinect and WIMU sensors, whilst the remaining 9 subjects were captured with Kinect data only.

Data formats

The formats of the different data streams are given in the following Table.


Sensor data

Codec

Parameters

Kinect RGB-D video

OpenNI (.ONI)

RGB: MJPEG
Depth: Uncompressed

RGB: 640x480
Depth: 640x480

Frame Rate: ~25Hz
(TimeStamps included in ONI)

Kinect Audio signals

PCM WAV

Mono,
16 bits, Endianness: Little

Sampling Rate: 22.05 KHz

WIMU signals:

ASCII

256Hz


Synchronization

The multiple RGB-D video and audio signals were captured by a common application process and care was taken to be synchronized. Therefore, they are synchronized up to a few msec (<15ms between multiple Kinect RGBD; <30ms between audio-to-audio or audio-to-RGBD). Additionally, for each stream, the system’s time that corresponds to the stream’s beginning is provided. This enables the automatic compensation of these small phase shifts between streams.

However, WIMU data could not be synchronised a-priori with the other modalities (audio and RGBD). To minimise this inconvenience and ensure post-synchronization, all users were instructed to execute a “clap procedure” before starting their performance. Hence, the start time of each data stream can be synchronised (either manually or automatically) by aligning the clap signatures that are clearly visible within a short time window from the beginning of every data stream (see, for instance, a snapshot of the clap signature in the audio streams below).





Wearable Inertial Measurement Units (WIMUs):

Each human subject was fitted 8 WIMUs, attached to the following locations on the body: left wrist, right wrist, chest, hips, left ankle, right ankle, left foot and right foot. These devices capture 3D acceleration (using accelerometers), 3D magnetic flux (using magnetometers) and 3D angular rate (using gyroscopes). Although these devices are capable of transmitting data wirelessly, data was instead logged to an internal SD-card on each device to ensure high frame rates of 256Hz. As mentioned in the previous section, synchronization of these devices and other modalities is achieved via a “clap procedure”. When both hands are clapped together a large acceleration spike is clearly visible in the accelerometer streams of the sensors attached to the wrists and can be aligned either manually or automatically with the audio signal.

Data segmentation – Annotation

During the capture sessions, each subject was recorded performing all the specified actions in one long sequence. However, to assist with segmentation, for gesture/activity classification tasks, each sequence has been manually annotated. A file containing: the action number, action name, start time and end time has been provided for each human subject within the dataset.


Dataset 2

The dataset recorded at Fraunhofer HHI consists of 7 individuals performing a set of 26 different body movements in front of a green-screen background - see table 1 below for a full list of sequences. Each sequence has been recorded with 6 synchronized video cameras at 2032x2032px / 25fps resolution. The complete dataset, consisting of 1092 clips at varying durations between 3 and 15 seconds, is provided as high quality H.264/AVC coded MP4 video files1, along with extrinsic and intrinsic calibration parameters for all cameras (see table 2 below for calibration details). Note that the radial lens distortion has already been removed from the video frames, in order to simplify further processing.


Figure 1: Stage setup and camera naming conventions.


Table 1: The 26 sequences recorded for 7 individuals

  1. Arms stretched out forward

  1. Arms crossed on chest

  1. Hands on waist

  1. Arms open

  1. Arms up

  1. Clapping

  1. Crossing the arms

  1. Crossing the legs

  1. Eye rubbing

  1. Eye movement and facial expressions

  1. Fingers motion

  1. Fists clenched

  1. Head scratching

  1. Head shaking

  1. Head tilting

  1. Kicking and punching

  1. Knocking the door

  1. Lifting an object

  1. Nodding

  1. Punching / Boxing

  1. Push away with both hands

  1. Bending the knees and standing up

  1. Throwing

  1. Touching one’s face

  1. Walking on a treadmill

  1. Waving one hand


Camera Calibration Data


Calibration data for the extrinsic and intrinsic camera parameters is provided in table 2: the camera locations and rotations relative to the common world coordinate system are given in terms of a 3D translation vector (in meter scale) and a 3D rotation matrix, along with the scaled focal length (in pixel units) for each camera. Note that radial lens distortion has already been compensated in the images.


Table 2: Camera calibration data, with translation and rotation values relative to the world coordinate system.



Cam0

Cam1

Cam2

Cam3

Cam4

Cam5

Scaled focal length f [pixel]

2946.92

2930.92

2928.57

2917.81

2787.50

2765.11

Camera position [m]

-1.6283

-0.0876

3.4741

-0.9991

-0.0927

3.7171

-0.4533

-0.1224

3.7542

1.1989

-0.0262

3.4931

1.6005

0.1268

3.1638

2.1674

0.1335

2.8693

Camera orientation (Rotation matrix elements in column-major order)

0.9023

-0.0183

0.4306

0.0788

0.9893

-0.1230

-0.4238

0.1449

0.8941

0.9577

-0.0412

0.2848

0.0831

0.9871

-0.1368

-0.2755

0.1547

0.9488

0.9898

0.0121

0.1419

0.0085

0.9896

-0.1435

-0.1422

0.1432

0.9794

0.9540

0.0054

-0.2996

-0.0571

0.9848

-0.1639

0.2942

0.1735

0.9399

0.9231

0.0346

-0.3830

-0.1107

0.9777

-0.1785

0.3682

0.2072

0.9063

0.8383

0.0585

-0.5420

-0.1573

0.9779

-0.1377

0.5220

0.2007

0.8290


Using the calibration data above, the following formulas illustrate how to project a 3d point from the camera-local coordinate system into the camera’s 2d image plane:


with:

the image width and height (in pixel units),

the 2d pixel indices (0…-1, 0…-1) with the image origin in the upper left corner,

The 3d object point in the right-handed camera coordinate system, with the X-axis pointing right, the Y-axis pointing up, and the Z-axis pointing into the negative viewing direction.


1 An uncompressed version of the dataset is available as PNG image sequences upon request.