Random Forests for Real Time Head Pose Estimation

Fast and reliable algorithms for estimating the head pose are essential for many applications and higher-level face analysis tasks. We address the problem of head pose estimation from depth data, which can be captured using the ever more affordable 3D sensing technologies available today.

To achieve robustness, we formulate pose estimation as a regression problem. While detecting specific face parts like the nose is sensitive to occlusions, we learn the regression on rather generic face surface patches. We propose to use random regression forests for the task at hand, given their capability to handle large training datasets.

In this page, our research work on head pose estimation is presented, source code is made available and an annotated database can be downloaded for evaluating other methods trying to tackle the same problem.

Real time head pose estimation from high-quality depth data

In our CVPR paper Real Time Head Pose Estimation with Random Regression Forests, we trained a random regression forest on a very large, synthetically generated face database. In our experiments, we show that our approach can handle real data presenting large pose changes, partial occlusions, and facial expressions, even though it is trained only on synthetic neutral face data. We have thoroughly evaluated our system on a publicly available database on which we achieve state-of-the-art performance without having to resort to the graphics card. The video shows the algorithm running in real time, on a frame by frame basis (no temporal smoothing), using as input high resolution depth images acquired with the range scanner of Weise et al.

Real time head pose estimation from low-quality depth data

In our DAGM paper Real Time Head Pose Estimation from Consumer Depth Cameras, we present a system for estimating location and orientation of a person's head, from depth data acquired by a low quality device. Our approach is based on discriminative random regression forests: ensembles of random trees trained by splitting each node so as to simultaneously reduce the entropy of the class labels distribution and the variance of the head position and orientation. That is, our forest first discriminates which parts of the image belong to a head, and use only those patches to cast votes for the final estimate. We evaluate three different approaches to jointly take classification and regression performance into account during training. For evaluation, we acquired a new dataset using a Kinect sensor and automatically annotated it using the technology provided by faceshift.

CODE

The discriminative random regression forest code used for the DAGM'11 paper is made available for research purposes. Together with the basic head pose estimation code, a demo is provided to run the estimation directly on the stream of depth images coming from a Kinect, using OpenNI. A sample forest is provided which was trained on the Biwi Kinect Head Pose Database.

Because the example forest is trained on the Kinect data, it should not be used for comparison with methods using high resolution scans. Also, the code does estimation on a frame-by-frame basis, not tracking, keep this in mind for comparisons.

Because the software is an adaptation of the Hough forest code, the same licence applies:

By installing, copying, or otherwise using this Software, you agree to be bound by the terms of the Microsoft Research Shared Source License Agreement (non-commercial use only). If you do not agree, do not install copy or use the Software. The Software is protected by copyright and other intellectual property laws and is licensed, not sold.

THE SOFTWARE COMES "AS IS", WITH NO WARRANTIES. THIS MEANS NO EXPRESS, IMPLIED OR STATUTORY WARRANTY, INCLUDING WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, ANY WARRANTY AGAINST INTERFERENCE WITH YOUR ENJOYMENT OF THE SOFTWARE OR ANY WARRANTY OF TITLE OR NON-INFRINGEMENT. THERE IS NO WARRANTY THAT THIS SOFTWARE WILL FULFILL ANY OF YOUR PARTICULAR PURPOSES OR NEEDS. ALSO, YOU MUST PASS THIS DISCLAIMER ON WHENEVER YOU DISTRIBUTE THE SOFTWARE OR DERIVATIVE WORKS.

NEITHER MICROSOFT NOR ANY CONTRIBUTOR TO THE SOFTWARE WILL BE LIABLE FOR ANY DAMAGES RELATED TO THE SOFTWARE OR THIS MSR-SSLA, INCLUDING DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL OR INCIDENTAL DAMAGES, TO THE MAXIMUM EXTENT THE LAW PERMITS, NO MATTER WHAT LEGAL THEORY IT IS BASED ON. ALSO, YOU MUST PASS THIS LIMITATION OF LIABILITY ON WHENEVER YOU DISTRIBUTE THE SOFTWARE OR DERIVATIVE WORKS.

DOWNLOAD

If you do use the code, please acknowledge our papers:

Random Forests for Real Time 3D Face Analysis

@article{fanelli_IJCV,
  author = {Fanelli, Gabriele and Dantone, Matthias and Gall, Juergen and Fossati, Andrea and Van Gool, Luc},
  title = {Random Forests for Real Time 3D Face Analysis},
  journal = {Int. J. Comput. Vision},
  year = {2013},
  month = {February},
  volume = {101},
  number = {3},
  pages = {437--458}
}

Real Time Head Pose Estimation with Random Regression Forests

@InProceedings{fanelli_CVPR11,
  author = {Fanelli, Gabriele and Gall, Juergen and Van Gool, Luc},
  title = {Real Time Head Pose Estimation with Random Regression Forests},
  booktitle = {Computer Vision and Pattern Recognition (CVPR)},
  year = {2011},
  month = {June},
  pages = {617-624}
}

Real Time Head Pose Estimation from Consumer Depth Cameras

@InProceedings{fanelli_DAGM11,
  author = {Fanelli, Gabriele and Weise, Thibaut and Gall, Juergen and Van Gool, Luc},
  title = {Real Time Head Pose Estimation from Consumer Depth Cameras},
  booktitle = {33rd Annual Symposium of the German Association for Pattern Recognition (DAGM'11)},
  year = {2011},
  month = {September}
}

If you have questions concerning the source code, please contact Gabriele Fanelli.

Biwi Kinect Head Pose Database

Because cheap consumer devices (e.g., Kinect) acquire row-resolution, noisy depth data, we could not train our algorithm on clean, synthetic images as was done in our previous CVPR work. Instead, we recorded several people sitting in front of a Kinect (at about one meter distance). The subjects were asked to freely turn their head around, trying to span all possible yaw/pitch angles they could perform.

To be able to evaluate our real-time head pose estimation system, the sequences were annotated using the automatic system of www.faceshift.com, i.e., each frame is annotated with the center of the head in 3D and the head rotation angles.

The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch. Ground truth is provided in the form of the 3D location of the head and its rotation.

Even though our algorithms work on depth images alone, we provide the RGB images as well. Please note that this is a database acquired with frame-by-frame estimation in mind, not tracking. For this reason, some frames are missing.

The database is made available for research purposes only. You are required to cite our work whenever publishing anything directly or indirectly using the data:

@article{fanelli_IJCV,
  author = {Fanelli, Gabriele and Dantone, Matthias and Gall, Juergen and Fossati, Andrea and Van Gool, Luc},
  title = {Random Forests for Real Time 3D Face Analysis},
  journal = {Int. J. Comput. Vision},
  year = {2013},
  month = {February},
  volume = {101},
  number = {3},
  pages = {437--458}
}

Files:

Data (5.6 GB, .tgz compressed)

Binary ground truth files

Masks used to select positive patches

Sample code for reading depth images and ground truth

If you have questions concerning the data, please contact Gabriele Fanelli.