/*********************************************************************/
/*                                                                   */
/* FILE         README.txt                                           */
/* AUTHORS      Michael D. Breitenstein, Daniel Kuettel              */
/* EMAIL        breitenstein@vision.ee.ethz.ch                       */
/* COPYRIGHT	ETH Zurich, Michael D. Breitenstein                  */
/* CONTENT      Description of data format and ground truth.         */
/*              Code Snippets for reading data and annotation.       */
/*              Reference implementation can be found in the classes */
/*              GroundTruth und InputCloud.                          */
/*                                                                   */
/*              When using the data and / or code, please cite our   */
/*              CVPR'08 paper:                                       */
/*                                                                   */
/*              M. D. Breitenstein and D. Kuettel and T. Weise and   */
/*              L. Van Gool and H. Pfister                           */
/*              Real-Time Face Pose Estimation                       */
/*              from Single Range Images                             */
/*              IEEE Conference on Computer Vision and               */
/*              Pattern Recognition (CVPR'08)                        */
/*              Anchorage, June 2008                                 */
/*                                                                   */
/* LAST CHANGE  April 16 2007                                        */
/*                                                                   */
/*********************************************************************/

1) Specifications of the data format:

Resolution gives the resolution in x and y. The input range images have been captured using a resolution of 640 columns and 480 lines. Therefore, the data set contains always this resolution.

The point cloud has a line-wise layout, starting at the top left. Each pixel contains x, y, and z coordinates. Pixels with z<0 are background pixels that should be ignored. (Usually too far away for the range camera.)

The frame_XXXXX.groundTruth files contain the ground truth for the corresponding frame.
The frame_XXXXX.input files contain the raw input data (the range image).


2) Code Snippet for reading in ground truth:

(Nose position and forward direction. Both already in the same coordinate system as the corresponding .input data.)

float3 nose;
float3 forward;
FILE* f = fopen(filename,"rb");
assert(f!=NULL);
fread(&nose,sizeof(float3),1,f);
fread(&forward,sizeof(float3),1,f);
fclose(f);

where float3 is simply a vectory type as
defined in cuda (include "vectory_types.h"):

struct float3 {
	float x,y,z;
};

You could replace it by an array float[3].

Writing is analogously:

FILE* f = fopen(filename,"wb");
assert(f!=NULL);
fwrite(&nose,sizeof(float3),1,f);
fwrite(&forward,sizeof(float3),1,f);
fclose(f);

The coordinate system is the same as in the input data:
Looking at your screen, x is to the right, y is up, and z is into the screen (left handed).

Nose gives you the position of the nose tip. Forward gives you the direction of the face. (Imagine a vector attached to the nose tip pointing to the direction the face is looking at.)


3) Code Snippet for reading in a range image:

float orientation[3][3];
float translation[3];
float3* cloud;
int2 resolution;
FILE* f = fopen(filename,"rb");
fread(orientation,sizeof(float),9,f);
fread(translation,sizeof(float),3,f);
fread(&resolution,sizeof(int),2,f);
cloud = (float3*)malloc(sizeof(float3) * resolution.x*resolution.y);
fread(cloud,sizeof(float3),resolution.x*resolution.y,f);
fclose(f);

Translation and orientation is the transformation to transform the range image coordinates into a global coordinate system, shared by all the input range images of the current test person. E.g., you could use this transformation to draw all frames on top of each other to visualize how well they are aligned.

Transform like this (x,y,z):
g.x = transformation[0]*x + transformation[4]*y + transformation[8]*z;
g.y = transformation[1]*x + transformation[5]*y + transformation[9]*z;
g.z = transformation[2]*x + transformation[6]*y + transformation[10]*z;
to the global float3 g.
And apply the translation, if needed. (Not needed for directions.)

However, you most likely won't need to transform because the ground truth is already in the same coordinate system as the input range image.
If your estimation algorithm outputs the estimation in the same coordinate system as the input range image from our data set you can compare it directly to provided ground truth.