List of datasets for machine-learning research

These datasets are used for machine-learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms, computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.

Image data

Datasets consisting primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification.

Facial recognition

In computer vision, face images have been used extensively to develop facial recognition systems, face detection, and many other projects that use images of faces.

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created	Reference	Creator
FERET	11338 images of 1199 individuals in different positions and at different times.	None.	11,338	Images	Classification, face recognition	2003		United States Department of Defense
Ryerson Audio-Visual Database of Emotional Speech and Song	7,356 video and audio recordings of 24 professional actors. 8 emotions each at two intensities.	Files labelled with expression. Perceptual validation ratings provided by 319 raters.	7,356	Video, sound files	Classification, face recognition, voice recognition	2018		S.R. Livingstone and F.A. Russo
SCFace	Color images of faces at various angles.	Location of facial features extracted. Coordinates of features given.	4,160	Images, text	Classification, face recognition	2011		M. Grgic et al.
Yale Face Database	Faces of 15 individuals in 11 different expressions.	Labels of expressions.	165	Images	Face recognition	1997		J. Yang et al.
Cohn-Kanade AU-Coded Expression Database	Large database of images with labels for expressions.	Tracking of certain facial features.	500+ sequences	Images, text	Facial expression analysis	2000		T. Kanade et al.
JAFFE Facial Expression Database	213 images of 7 facial expressions posed by 10 Japanese female models.	Images are cropped to the facial region. Includes semantic ratings data on emotion labels.	213	Images, text	Facial expression cognition	1998		Lyons, Kamachi, Gyoba
FaceScrub	Images of public figures scrubbed from image searching.	Name and m/f annotation.	107,818	Images, text	Face recognition	2014		H. Ng et al.
BioID Face Database	Images of faces with eye positions marked.	Manually set eye positions.	1521	Images, text	Face recognition	2001		BioID
Skin Segmentation Dataset	Randomly sampled color values from face images.	B, G, R, values extracted.	245,057	Text	Segmentation, classification	2012		R. Bhatt.
Bosphorus	3D Face image database.	34 action units and 6 expressions labeled; 24 facial landmarks labeled.	4652	Images, text	Face recognition, classification	2008		A Savran et al.
UOY 3D-Face	neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised.	labeling.	5250	Images, text	Face recognition, classification	2004		University of York
CASIA 3D Face Database	Expressions: Anger, smile, laugh, surprise, closed eyes.	None.	4624	Images, text	Face recognition, classification	2007		Institute of Automation, Chinese Academy of Sciences
CASIA NIR	Expressions: Anger Disgust Fear Happiness Sadness Surprise	None.	480	Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per second	Face recognition, classification	2011		Zhao, G. et al.
BU-3DFE	neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear. 3D images extracted.	None.	2500	Images, text	Facial expression recognition, classification	2006		Binghamton University
Face Recognition Grand Challenge Dataset	Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data.	None.	4007	Images, text	Face recognition, classification	2004		National Institute of Standards and Technology
Gavabdb	Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images.	None.	549	Images, text	Face recognition, classification	2008		King Juan Carlos University
3D-RMA	Up to 100 subjects, expressions mostly neutral. Several poses as well.	None.	9971	Images, text	Face recognition, classification	2004		Royal Military Academy
SoF	112 persons wear glasses under different illumination conditions.	A set of synthetic filters with different level of difficulty.	42,592	Images, Mat file	Gender classification, face detection, face recognition, age estimation, and glasses detection	2017		Afifi, M. et al.
IMDB-WIKI	IMDB and Wikipedia face images with gender and age labels.	-	523,051	Images	Gender classification, face detection, face recognition, age estimation	2015		R. Rothe, R. Timofte, L. V. Gool

Action recognition

Object detection and recognition

Handwriting and character recognition

Aerial images

Other images

Text data

Datasets consisting primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews

News articles

Messages

Twitter and tweets

Dialogues

Other text

Sound data

Datasets of sounds and sound features.

Speech

Music

Other sounds

Signal data

Datasets containing electric signal information requiring some sort of Signal processing for further analysis.

Electrical

Motion-tracking

Other signals

Physical data

Datasets from physical systems.

High-energy physics

Systems

Astronomy

Earth science

Other physical

Biological data

Datasets from biological systems.

Human

Animal

Plant

Microbe

Drug Discovery

Anomaly data

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created	Reference	Creator
Numenta Anomaly Benchmark	Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.	-	50+ files	Comma separated values	Anomaly detection	2016		Numenta
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study	Most data files are adapted from UCI Machine Learning Repository data, some are collected from the literature.	treated for missing values, numerical attributes only, different percentages of anomalies, labels	1000+ files	ARFF	Anomaly detection	2016		Campos et al.

Question Answering data

This section includes datasets that deals with structured data.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created	Reference	Creator
DBpedia Neural Question Answering Dataset	A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase.	This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts.	894,499	Question-query pairs	Question Answering	2018		Hartmann, Soru, and Marx et al.

Multivariate data

Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.

Financial

Weather

Census

Transit

Internet

Games

Other multivariate

Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.

OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...