Language Dataset

dataset

posted on 2023-11-30, 08:48 authored by David TowersDavid Towers, Rob Geada, Amir Atapour-Abarghouei, Andrew Stephen McGough

Dataset containing the images and labels for the Language data used in the CVPR NAS workshop Unseen-data challenge under the codename "LaMelo"

The Language dataset is a constructed dataset using words from aspell dictionaries. The intention of this dataset is to require machine learning models to not only perform image classification but also linguistic analysis to figure out which letter frequency is associated with each language. For each Language image we selected four six-letter words using the standard latin alphabet and removed any words with letters that used diacritics (such as ́e or ̈u) or included ‘y’ or ‘z’.

We encode these words on a graph with one axis representing the index of the 24 character long string (the four words joined together) and the other representing the letter (going A-X).

The data is in a channels-first format with a shape of (n, 1, 24, 24) where n is the number of samples in the corresponding set (50,000 for training, 10,000 for validation, and 10,000 for testing).

There are ten classes in the dataset, with 7,000 examples of each, distributed evenly between the three subsets.

The ten classes and corresponding numerical label are as follows:

English: 0,
Dutch: 1,
German: 2,
Spanish: 3,
French: 4,
Portuguese: 5,
Swahili: 6,
Zulu: 7,
Finnish: 8,
Swedish: 9

Language Dataset

History

Usage metrics

Categories

Keywords

Licence

Exports