LLD - Large Logo Dataset

LLD - Large Logo Dataset v1

The following is the final version of the Large Logo Dataset (LLD), a dataset of 600k+ logos crawled from the internet. This dataset was introduced with our paper Logo Synthesis and Manipulation with Clustered Generative Adverserial Network

The dataset consists of two parts, crawled from the the Alexa 1M websites list:

LLD-icon: 548,210 standardized 32 x 32-pixel favicons, crawled on April 7th 2017
LLD-logo: 122,920 high-resolution (mostly 400 x 400 pixels) logos gathered from twitter on August 17th 2017

LLD-icon

The main features of LLD-icon are:

Standardized resolution: 32 x 32 pixels
Training-friendly formats:
- Python Pickle: A sequence of binary python pickle files, each containing 100.000 logos (except for the last one) as numpy arrays in a random permutation of the data. The arrays are of standardized shape (32, 32, 3) ready for use in TensorFlow or similar machine learning frameworks.
- HDF5: Hierarchical Data Format containing images and labels in a single file designed for flexible and efficiant I/O as well as portablility. This data format uses CHW (Channels, Height, Width) dimension ordering (3, 32, 32)
- Single-file format: Single PNG files for maximum flexibility of use
Metadata:

Indices for subset of (non-upscaled) sharp icons
Names/Ranks

The LLD-icon dataset is provided in two versions:

Full version containing all favicons we where able to crawl, with duplicates removed. Size: 548.210 images
Clean version, optimized for use with GAN's: Here we attempted to remove all non logo-like images, especially natural images like faces as well as very complex and empty logos. Size: 486.377 images

Data Acquisition

The logos in this dataset where crawled by parsing the HTML pages with scrapy to look for the favicon declaration as described on Wikipedia: Favicon. If this failed, we attempted to download the default URL http://url/favicon.ico.

After downloading, all logos where directly converted to RGB with 32 pixel size, rejecting all non-square logos. Some statistics:

Unreadable image files: 71.596
Non-square images: 36.401
Total logos saved: 662.273
of which duplicates removed: 114.063

This resulted in 548.210 logos, in the following resolution:

Native 32p: 158.881 (24%)
Downscaled: 148.132 (22%)
Upscaled: 355.260 (54%)

Data Formats

HDF5: HDF5: Hierarchical Data Format containing images and labels in a single file designed for flexible and efficient I/O as well as portability. This data format uses CHW (Channels, Height, Width) dimension ordering.

PKL: The .pkl file is readable using the python package pickle or cPickle (see also provided sample code). It contains a numpy array of shape (number_of_icons, 32, 32, 3) and type uint8.

FILES: Single image files in original format (mostly jpeg or png)

HDF5 File Contents

LLD-icon:

Path	Description
data	Image Data - array of shape (486377, 3, 32, 32)
meta_data/names	Domain name corresponding to the logo on data with same index
labels/AE_grayscale	Clustering labels acquired with AE-method (trained on grayscale images) as described in our paper. 100 cluster centers.
labels/resnet/rc_32	Clustering labels acquired with RC-method described in our paper. 32 cluster centers.
labels/resnet/rc_64	Labels for 64 RC-clusters.
labels/resnet/rc_128	Labels for 128 RC-clusters.

LLD-icon-sharp

Path	Description
data	Image Data - array of shape (221369, 3, 32, 32)
meta_data/names	Domain name corresponding to the logo on data with same index
labels/resnet/rc_16	Clustering labels acquired with RC-method described in our paper. 16 cluster centers.
labels/resnet/rc_32	Labels for 32 RC-clusters.
labels/resnet/rc_64	Labels for 64 RC-clusters.
labels/resnet/rc_128	Labels for 32 RC-clusters.

Download

Data

LLD-icon HDF5 (486.377 logos) [762MB] LLD-icon-sharp HDF5 (221.369 logos) [300MB]

LLD-icon PKL (486.377 logos) [663MB] LLD-icon PKL full data (548.210 logos) [809MB]

LLD-icon FILES (486.377 logos) [775MB] LLD-icon FILES full data (548.210 logos) [915MB]

Meta-Data

Indices for LLD-icon-sharp (pickle format) [1MB]

Domain names for LLD-icon Domain names for LLD-icon full data

Sample scripts for loading the data in python:

Script_HDF5.py Requirements: NumPy, H5py
Script_PKL.py Requirements: NumPy

A sample of 5000 logos from LLD-icon (PNG):

LLD-icon sample (5000 logos) [8MB]

LLD-logo

The main features of LLD-logo are:

Kept at original resolution, between 64 and 400 pixels.
Training-friendly formats:
- HDF5: Hierarchical Data Format containing images and meta data in a single file designed for flexible and efficient I/O as well as portability. This data format uses CHW (Channel Height Width) dimension ordering. Because of the varying resolutions, the images are zero-padded (more on this below)
- Single-file format: Single image files for maximum flexibility of use (various image formats)
Rich metadata as described in the table below

The LLD-logo dataset is provided in two versions:

Cleaned version as used in our paper (recommended) in HDF5 and single-file format
Raw image files as originally scraped from the web. Note: This dataset contains a large portion of images that are not logos and might of an undesired nature (e.g. pornographic content).

Data Formats

HDF5: HDF5: Hierarchical Data Format containing images, labels, original domain names and additional twitter metadata in a single file, designed for flexible and efficient I/O as well as portability. This data format uses CHW (Channels, Height, Width) dimension ordering and zero-padding to a uniform size of 400x400 pixels. The original image resolution is stored in the "shapes" dataset and can be used to easily remove the zero-padding.

FILES-PNG: Single image files in PNG format

FILES: Single image files in original format (mostly jpeg or png)

HDF5 File Contents

Path	Description
data	Image Data - array of shape (122920, 3, 400, 400) Smaller images where zero-padded to 400x400 pixels.
shapes	Image shapes - array of shape (122920, 1, 1, 1) containing the original image array shapes before zero padding.
meta_data/names	Domain name corresponding to the logo in data with same index
meta_data/ids	Twitter user-id corresponding to the logo in data with same index
meta_data/user_object	Serialized tweepy user object (JSON format) corresponding to the logo in data with same index
labels/ae_grayscale_64px/clusters_64	Clustering labels acquired with AE-method (trained on grayscale images downsampled to 64 pixels) as described in our paper. 64 cluster centers.
labels/resnet/rc_32	Clustering labels acquired with RC-method described in our paper. 32 cluster centers.
labels/resnet/rc_64	Labels for 64 RC-clusters.
labels/resnet/rc_128	Labels for 128 RC-clusters.

Download

LLD-logo HDF5 (122,920 logos) [13GB]

LLD-logo FILES-PNG (122,920 logos) [6.2GB]

Full unprocessed image data (Warning: contains a high amount of non-logo images, see notes above):
LLD-logo-full_data FILES (182,998 logos) [5.7GB]

Metadata for single-file versions of LLD-logo (dict indexed with file names):
LLD-logo_metadata (pkl) [166MB]

A sample script for loading the data in python
Script_LLD-logo.py Requirements: NumPy, H5py

A sample of 5000 logos from LLD-logo (PNG)
LLD-logo sample (500 logos) [27MB]

License

Please notice that this dataset is made available for academic research purposes only. All the images are collected from the Internet, and the copyright belongs to the original owners. If any of the images belongs to you and you would like it removed, please kindly inform us, we will remove it from our dataset immediately.

Citation

Please add a reference if you are using the dataset

@misc{sage2017logodataset,
author={Sage, Alexander and Agustsson, Eirikur and Timofte, Radu and Van Gool, Luc},
title = {LLD - Large Logo Dataset - version 0.1},
year = {2017}, 
howpublished = "\url{https://data.vision.ee.ethz.ch/cvl/lld}"}

Last updated: 26. September 2018

LLD - Large Logo Dataset

Logo Synthesis and Manipulation with Clustered Generative Adverserial Network

Logo Generator Codebase