NEC Laboratories America, Inc.

The ABCDETC. dataset


The abcdetc. dataset is a collection of handwritten digits (0-9), upper (A-Z) and lower (a-z) case letters, and a selection of symbols (, . ! ? ; : = $-$ $+$ / ( ) \$ \% " and @). There are 78 classes in all.

Here is part of a typical scanned sheet used to compile the ABCDETC dataset (the fourth row here tells the subject what to write in the five rows below it). Click here to see an example of a full scanned in sheet, and here to see an empty sheet..



The data was collected from the handwriting of researchers and employees at NEC Laboratories America, Princeton, the Max-Planck Insitute for Biological Cybernetics in Tuebingen and Pennsylvania State University. Subjects wrote in pen 5 versions of each symbol on a single gridded sheet. The sheets were scanned at 300dpi, and the symbols are stored as 100 x 100 patches, which were automatically extracted and then centered using the center of mass of the pixels. In the first release of the dataset there are 51 subjects resulting in a dataset of 19,646 examples, after outlier removal. The following is an example of part of a typical sheet: We plan to expand the dataset to include more subjects.

Download

Click here to download the dataset in the SVMLight/LIBSVM sparse Ascii format.

The format is one example per line, the first number is the label (from 0-77), in the same order as on the sheet itself (first row, first, i.e. A-Z, a-z, 0-9 and then the symbols). The rest of the data are the pixel features, stored in a sparse format, of the form <feature number>:<value>. Feature 10000 is the ID of the subject (0-51).


Previous usage

This dataset was used in the following article:

J. Weston, R. Collobert, F. Sinz, L. Bottou and V. Vapnik. Inference with the Universum . ICML, 2006.


Credits

This dataset was collected by: Fabian Sinz, Ronan Collobert, Jason Weston and Seyda Ertekin.
Thanks also to everyone who filled out the sheets for us!

NEC Laboratories America Home

©2006 NEC Laboratories America, Inc. All rights reserved.