List of datasets for machine-learning research


These datasets are used for machine-learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms, computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.

Image data

Datasets consisting primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification.

Facial recognition

In computer vision, face images have been used extensively to develop facial recognition systems, face detection, and many other projects that use images of faces.
Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated ReferenceCreator
FERET 11338 images of 1199 individuals in different positions and at different times.None.11,338ImagesClassification, face recognition2003United States Department of Defense
Ryerson Audio-Visual Database of Emotional Speech and Song 7,356 video and audio recordings of 24 professional actors. 8 emotions each at two intensities.Files labelled with expression. Perceptual validation ratings provided by 319 raters.7,356Video, sound filesClassification, face recognition, voice recognition2018S.R. Livingstone and F.A. Russo
SCFaceColor images of faces at various angles.Location of facial features extracted. Coordinates of features given.4,160Images, textClassification, face recognition2011M. Grgic et al.
Yale Face DatabaseFaces of 15 individuals in 11 different expressions.Labels of expressions.165ImagesFace recognition1997J. Yang et al.
Cohn-Kanade AU-Coded Expression DatabaseLarge database of images with labels for expressions.Tracking of certain facial features.500+ sequencesImages, textFacial expression analysis2000T. Kanade et al.
JAFFE Facial Expression Database213 images of 7 facial expressions posed by 10 Japanese female models.Images are cropped to the facial region. Includes semantic ratings data on emotion labels.213Images, textFacial expression cognition1998Lyons, Kamachi, Gyoba
FaceScrubImages of public figures scrubbed from image searching.Name and m/f annotation.107,818Images, textFace recognition2014H. Ng et al.
BioID Face DatabaseImages of faces with eye positions marked.Manually set eye positions.1521Images, textFace recognition2001BioID
Skin Segmentation DatasetRandomly sampled color values from face images.B, G, R, values extracted.245,057TextSegmentation, classification2012R. Bhatt.
Bosphorus3D Face image database.34 action units and 6 expressions labeled; 24 facial landmarks labeled.4652
Images, text
Face recognition, classification2008A Savran et al.
UOY 3D-Faceneutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised.labeling.5250
Images, text
Face recognition, classification2004University of York
CASIA 3D Face DatabaseExpressions: Anger, smile, laugh, surprise, closed eyes.None.4624
Images, text
Face recognition, classification2007Institute of Automation, Chinese Academy of Sciences
CASIA NIRExpressions: Anger Disgust Fear Happiness Sadness SurpriseNone.480Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per secondFace recognition, classification2011Zhao, G. et al.
BU-3DFEneutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear. 3D images extracted.None.2500Images, textFacial expression recognition, classification2006Binghamton University
Face Recognition Grand Challenge DatasetUp to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data.None.4007Images, textFace recognition, classification2004National Institute of Standards and Technology
GavabdbUp to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images.None.549Images, textFace recognition, classification2008King Juan Carlos University
3D-RMAUp to 100 subjects, expressions mostly neutral. Several poses as well.None.9971Images, textFace recognition, classification2004Royal Military Academy
SoF112 persons wear glasses under different illumination conditions.A set of synthetic filters with different level of difficulty.42,592 Images, Mat fileGender classification, face detection, face recognition, age estimation, and glasses detection2017Afifi, M. et al.
IMDB-WIKIIMDB and Wikipedia face images with gender and age labels.-523,051ImagesGender classification, face detection, face recognition, age estimation2015R. Rothe, R. Timofte, L. V. Gool

Action recognition

Object detection and recognition

Handwriting and character recognition

Aerial images

Other images

Text data

Datasets consisting primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews

News articles

Messages

Twitter and tweets

Dialogues

Other text

Sound data

Datasets of sounds and sound features.

Speech

Music

Other sounds

Signal data

Datasets containing electric signal information requiring some sort of Signal processing for further analysis.

Electrical

Motion-tracking

Other signals

Physical data

Datasets from physical systems.

High-energy physics

Systems

Astronomy

Earth science

Other physical

Biological data

Datasets from biological systems.

Human

Animal

Plant

Microbe

Drug Discovery

Anomaly data

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated ReferenceCreator
Numenta Anomaly Benchmark Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.-50+ filesComma separated valuesAnomaly detection2016 Numenta
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical StudyMost data files are adapted from UCI Machine Learning Repository data, some are collected from the literature.treated for missing values, numerical attributes only, different percentages of anomalies, labels1000+ filesARFFAnomaly detection2016 Campos et al.

Question Answering data

This section includes datasets that deals with structured data.
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated ReferenceCreator
DBpedia Neural Question Answering DatasetA large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase.This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts.894,499Question-query pairsQuestion Answering2018Hartmann, Soru, and Marx et al.

Multivariate data

Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.

Financial

Weather

Census

Transit

Internet

Games

Other multivariate

Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.