Gutenberg Dataset

dataset

posted on 2023-11-30, 08:49 authored by David TowersDavid Towers, Rob Geada, Amir Atapour-Abarghouei, Andrew Stephen McGough

Dataset containing the images and labels for the Gutenberg data used in the CVPR NAS workshop Unseen-data challenge under the codename "Gutenberg", which we decided to keep as the official name.

The Gutenberg dataset is constructed dataset containing phrases from famous literary works that have been made available by project Gutenberg (www.gutenberg.org) which provides free ebooks of literary works that are no longer under US copyright protection*. Given the name is descriptive of its content, we decided to keep the code name Gutenberg we used in the competition.

This dataset was created by accessing several works by six popular authors (see label mapping below). The works downloaded were English translations chosen to represent a variety of cultures and time periods. We performed basic text preprocessing over each text, removing punctuation, converting letters with diacritics to the base letter, and removing "structure" words (e.g., 'Chapter', 'Scene', 'Prologue').

We then extracted consecutive sequences of three words between 3 and 6 letters long. In each sequence, the three words were padded up to 6 characters with spaces. Then the three words were concatenated together to produce an 18-character string. These strings were used as the base for image creation. Training, test, and validation sequences were chosen such that there was no overlap between any sequence across any data split.

These strings are encoded into images, using a graph, the x-axis represents each characters index in the 18-character long strings, and the y-axis represents the corresponding letter (the axis is arranged alphabetically A-Z with a space being represented underneath the Z)

The data is in a channels-first format with a shape of (n, 1, 27, 18) where n is the number of samples in the corresponding set (45,000 for training, 15,000 for validation, and 6,000 for testing).

There are six classes in the dataset, with 11,000 examples of each, distributed evenly between the three subsets.

The six classes and corresponding numerical label are as follows:

aquinas: 0,
confucius: 1,
hawthorne: 2,
plato: 3,
shakespeare: 4,
tolstoy: 5

*These works fall into the public domain under US copyright law, please check whether the works are available under your country's copyright laws.