The goal of this challenge is to advance the area of learning knowledge and representation from web data. The web data not only contains huge numbers of visual images, but also rich meta information concerning these visual data, which could be exploited to learn good representations and models.
The WebVision dataset is composed of training, validation, and test set. The training set is downloaded from Web without any human annotation. The validation and test set are human annotated, where the labels of validation data are provided but the labels of test data are withheld. To imitate the setting of learning from web data, the participants are required to learn their models solely on the training set and submit classification results on the test set. The validation set could only be used to evaluate the algorithms during development (see details in Honor Code). Each submission will produce a list of 5 labels in the descending order of confidence for each image. The recognition accuracy is evaluated based on the label which best matches the ground truth label for the image. Specifically, an algorithm will produce a label list: \(c_i\), \(i=1,...,5\) for each image and the ground truth labels of the image are: \(y_j\), \( j = 1,..., n \) with n class labels. The error of this prediction is defined as: $$E = \frac{1}{n} \sum_{j=1}^n \min_{i} d(c_i, y_j).$$ The \(d(c_i,y_j)\) is calculated as 0 if \(c_i=y_j\) and 1 otherwise. Since different concepts have different number of test images in WebVision 2.0 dataset, we calculate the mean error for each concept individually, and the final error is the average of mean errors across all classes. For this version of the challenge, there is only one ground truth label for each image (i.e., \(n=1\)).
Click here to participate in the WebVision Image Classification Track