LLD - Large Logo Dataset

Alexander Sage, Eirikur Agustsson, Radu Timofte, Luc Van Gool


Logo Synthesis and Manipulation with Clustered Generative Adverserial Network

arXiv:1712.04407

Designing a logo for a new brand is a lengthy and tedious back-and-forth process between a designer and a client. In this paper we explore to what extent machine learning can solve the creative task of the designer. For this, we build a dataset -- LLD -- of 600k+ logos crawled from the world wide web. Training Generative Adversarial Networks (GANs) for logo synthesis on such multi-modal data is not straightforward and results in mode collapse for some state-of-the-art methods. We propose the use of synthetic labels obtained through clustering to disentangle and stabilize GAN training. We are able to generate a high diversity of plausible logos and we demonstrate latent space exploration techniques to ease the logo design task in an interactive manner. Moreover, we validate the proposed clustered GAN training on CIFAR 10, achieving state-of-the-art Inception scores when using synthetic labels obtained via clustering the features of an ImageNet classifier. GANs can cope with multi-modal data by means of synthetic labels achieved through clustering, and our results show the creative potential of such techniques for logo synthesis and manipulation.

PDF

Logo Generator Codebase

Available on Github!

On GitHub we provide the python code for the logo generator as well as pre-trained models for generating logos based on our LLD dataset.


LLD - Large Logo Dataset v1

The following is the final version of the Large Logo Dataset (LLD), a dataset of 600k+ logos crawled from the internet. This dataset was introduced with our paper Logo Synthesis and Manipulation with Clustered Generative Adverserial Network

The dataset consists of two parts, crawled from the the Alexa 1M websites list:

  • LLD-icon: 548,210 standardized 32 x 32-pixel favicons, crawled on April 7th 2017
  • LLD-logo: 122,920 high-resolution (mostly 400 x 400 pixels) logos gathered from twitter on August 17th 2017


LLD-icon

The main features of LLD-icon are:

  • Standardized resolution: 32 x 32 pixels
  • Training-friendly formats:
    • Python Pickle: A sequence of binary python pickle files, each containing 100.000 logos (except for the last one) as numpy arrays in a random permutation of the data. The arrays are of standardized shape (32, 32, 3) ready for use in TensorFlow or similar machine learning frameworks.
    • HDF5: Hierarchical Data Format containing images and labels in a single file designed for flexible and efficiant I/O as well as portablility. This data format uses CHW (Channels, Height, Width) dimension ordering (3, 32, 32)
    • Single-file format: Single PNG files for maximum flexibility of use
  • Metadata:
    • Indices for subset of (non-upscaled) sharp icons
    • Names/Ranks

The LLD-icon dataset is provided in two versions:

  • Full version containing all favicons we where able to crawl, with duplicates removed. Size: 548.210 images
  • Clean version, optimized for use with GAN's: Here we attempted to remove all non logo-like images, especially natural images like faces as well as very complex and empty logos. Size: 486.377 images

Data Acquisition

The logos in this dataset where crawled by parsing the HTML pages with scrapy to look for the favicon declaration as described on Wikipedia: Favicon. If this failed, we attempted to download the default URL http://url/favicon.ico.

After downloading, all logos where directly converted to RGB with 32 pixel size, rejecting all non-square logos. Some statistics:

  • Unreadable image files: 71.596
  • Non-square images: 36.401
  • Total logos saved: 662.273
  • of which duplicates removed: 114.063

This resulted in 548.210 logos, in the following resolution:

  • Native 32p: 158.881 (24%)
  • Downscaled: 148.132 (22%)
  • Upscaled: 355.260 (54%)

Data Formats

HDF5: HDF5: Hierarchical Data Format containing images and labels in a single file designed for flexible and efficient I/O as well as portability. This data format uses CHW (Channels, Height, Width) dimension ordering.

PKL: The .pkl file is readable using the python package pickle or cPickle (see also provided sample code). It contains a numpy array of shape (number_of_icons, 32, 32, 3) and type uint8.

FILES: Single image files in original format (mostly jpeg or png)

HDF5 File Contents

LLD-icon:
Path Description
data Image Data - array of shape (486377, 3, 32, 32)
meta_data/names Domain name corresponding to the logo on data with same index
labels/AE_grayscale Clustering labels acquired with AE-method (trained on grayscale images) as described in our paper. 100 cluster centers.
labels/resnet/rc_32 Clustering labels acquired with RC-method described in our paper. 32 cluster centers.
labels/resnet/rc_64 Labels for 64 RC-clusters.
labels/resnet/rc_128 Labels for 128 RC-clusters.
LLD-icon-sharp
Path Description
data Image Data - array of shape (221369, 3, 32, 32)
meta_data/names Domain name corresponding to the logo on data with same index
labels/resnet/rc_16 Clustering labels acquired with RC-method described in our paper. 16 cluster centers.
labels/resnet/rc_32 Labels for 32 RC-clusters.
labels/resnet/rc_64 Labels for 64 RC-clusters.
labels/resnet/rc_128 Labels for 32 RC-clusters.

Download

Data

LLD-icon HDF5 (486.377 logos) [762MB] LLD-icon-sharp HDF5 (221.369 logos) [300MB]

LLD-icon PKL (486.377 logos) [663MB] LLD-icon PKL full data (548.210 logos) [809MB]

LLD-icon FILES (486.377 logos) [775MB] LLD-icon FILES full data (548.210 logos) [915MB]

Meta-Data

Indices for LLD-icon-sharp (pickle format) [1MB]

Domain names for LLD-icon Domain names for LLD-icon full data

Sample scripts for loading the data in python:

Script_HDF5.py    Requirements: NumPy, H5py
Script_PKL.py      Requirements: NumPy

A sample of 5000 logos from LLD-icon (PNG):

LLD-icon sample (5000 logos) [8MB]

LLD-logo

The main features of LLD-logo are:

  • Kept at original resolution, between 64 and 400 pixels.
  • Training-friendly formats:
    • HDF5: Hierarchical Data Format containing images and meta data in a single file designed for flexible and efficient I/O as well as portability. This data format uses CHW (Channel Height Width) dimension ordering. Because of the varying resolutions, the images are zero-padded (more on this below)
    • Single-file format: Single image files for maximum flexibility of use (various image formats)
  • Rich metadata as described in the table below

The LLD-logo dataset is provided in two versions:

  • Cleaned version as used in our paper (recommended) in HDF5 and single-file format
  • Raw image files as originally scraped from the web. Note: This dataset contains a large portion of images that are not logos and might of an undesired nature (e.g. pornographic content).

Data Formats

HDF5: HDF5: Hierarchical Data Format containing images, labels, original domain names and additional twitter metadata in a single file, designed for flexible and efficient I/O as well as portability. This data format uses CHW (Channels, Height, Width) dimension ordering and zero-padding to a uniform size of 400x400 pixels. The original image resolution is stored in the "shapes" dataset and can be used to easily remove the zero-padding.

FILES-PNG: Single image files in PNG format

FILES: Single image files in original format (mostly jpeg or png)

HDF5 File Contents

Path Description
data Image Data - array of shape (122920, 3, 400, 400)
Smaller images where zero-padded to 400x400 pixels.
shapes Image shapes - array of shape (122920, 1, 1, 1) containing the original image array shapes before zero padding.
meta_data/names Domain name corresponding to the logo in data with same index
meta_data/ids Twitter user-id corresponding to the logo in data with same index
meta_data/user_object Serialized tweepy user object (JSON format) corresponding to the logo in data with same index
labels/ae_grayscale_64px/clusters_64 Clustering labels acquired with AE-method (trained on grayscale images downsampled to 64 pixels) as described in our paper. 64 cluster centers.
labels/resnet/rc_32 Clustering labels acquired with RC-method described in our paper. 32 cluster centers.
labels/resnet/rc_64 Labels for 64 RC-clusters.
labels/resnet/rc_128 Labels for 128 RC-clusters.

Download

LLD-logo HDF5 (122,920 logos) [13GB]

LLD-logo FILES-PNG (122,920 logos) [6.2GB]

Full unprocessed image data (Warning: contains a high amount of non-logo images, see notes above):

LLD-logo-full_data FILES (182,998 logos) [5.7GB]

Metadata for single-file versions of LLD-logo (dict indexed with file names):

LLD-logo_metadata (pkl) [166MB]

A sample script for loading the data in python
Script_LLD-logo.py Requirements: NumPy, H5py

A sample of 5000 logos from LLD-logo (PNG)
LLD-logo sample (500 logos) [27MB]

License

Please notice that this dataset is made available for academic research purposes only. All the images are collected from the Internet, and the copyright belongs to the original owners. If any of the images belongs to you and you would like it removed, please kindly inform us, we will remove it from our dataset immediately.

Citation

Please add a reference if you are using the dataset

@misc{sage2017logodataset,
author={Sage, Alexander and Agustsson, Eirikur and Timofte, Radu and Van Gool, Luc},
title = {LLD - Large Logo Dataset - version 0.1},
year = {2017}, 
howpublished = "\url{https://data.vision.ee.ethz.ch/cvl/lld}"}
Last updated: 26. September 2018