List of datasets for machine-learning research
These datasets are used for machine-learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms, computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.
Image data
Datasets consisting primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification.Facial recognition
In computer vision, face images have been used extensively to develop facial recognition systems, face detection, and many other projects that use images of faces.Dataset name | Brief description | Preprocessing | Instances | Format | Default task | Created | Reference | Creator |
FERET | 11338 images of 1199 individuals in different positions and at different times. | None. | 11,338 | Images | Classification, face recognition | 2003 | United States Department of Defense | |
Ryerson Audio-Visual Database of Emotional Speech and Song | 7,356 video and audio recordings of 24 professional actors. 8 emotions each at two intensities. | Files labelled with expression. Perceptual validation ratings provided by 319 raters. | 7,356 | Video, sound files | Classification, face recognition, voice recognition | 2018 | S.R. Livingstone and F.A. Russo | |
SCFace | Color images of faces at various angles. | Location of facial features extracted. Coordinates of features given. | 4,160 | Images, text | Classification, face recognition | 2011 | M. Grgic et al. | |
Yale Face Database | Faces of 15 individuals in 11 different expressions. | Labels of expressions. | 165 | Images | Face recognition | 1997 | J. Yang et al. | |
Cohn-Kanade AU-Coded Expression Database | Large database of images with labels for expressions. | Tracking of certain facial features. | 500+ sequences | Images, text | Facial expression analysis | 2000 | T. Kanade et al. | |
JAFFE Facial Expression Database | 213 images of 7 facial expressions posed by 10 Japanese female models. | Images are cropped to the facial region. Includes semantic ratings data on emotion labels. | 213 | Images, text | Facial expression cognition | 1998 | Lyons, Kamachi, Gyoba | |
FaceScrub | Images of public figures scrubbed from image searching. | Name and m/f annotation. | 107,818 | Images, text | Face recognition | 2014 | H. Ng et al. | |
BioID Face Database | Images of faces with eye positions marked. | Manually set eye positions. | 1521 | Images, text | Face recognition | 2001 | BioID | |
Skin Segmentation Dataset | Randomly sampled color values from face images. | B, G, R, values extracted. | 245,057 | Text | Segmentation, classification | 2012 | R. Bhatt. | |
Bosphorus | 3D Face image database. | 34 action units and 6 expressions labeled; 24 facial landmarks labeled. | 4652 | Images, text | Face recognition, classification | 2008 | A Savran et al. | |
UOY 3D-Face | neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised. | labeling. | 5250 | Images, text | Face recognition, classification | 2004 | University of York | |
CASIA 3D Face Database | Expressions: Anger, smile, laugh, surprise, closed eyes. | None. | 4624 | Images, text | Face recognition, classification | 2007 | Institute of Automation, Chinese Academy of Sciences | |
CASIA NIR | Expressions: Anger Disgust Fear Happiness Sadness Surprise | None. | 480 | Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per second | Face recognition, classification | 2011 | Zhao, G. et al. | |
BU-3DFE | neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear. 3D images extracted. | None. | 2500 | Images, text | Facial expression recognition, classification | 2006 | Binghamton University | |
Face Recognition Grand Challenge Dataset | Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data. | None. | 4007 | Images, text | Face recognition, classification | 2004 | National Institute of Standards and Technology | |
Gavabdb | Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images. | None. | 549 | Images, text | Face recognition, classification | 2008 | King Juan Carlos University | |
3D-RMA | Up to 100 subjects, expressions mostly neutral. Several poses as well. | None. | 9971 | Images, text | Face recognition, classification | 2004 | Royal Military Academy | |
SoF | 112 persons wear glasses under different illumination conditions. | A set of synthetic filters with different level of difficulty. | 42,592 | Images, Mat file | Gender classification, face detection, face recognition, age estimation, and glasses detection | 2017 | Afifi, M. et al. | |
IMDB-WIKI | IMDB and Wikipedia face images with gender and age labels. | - | 523,051 | Images | Gender classification, face detection, face recognition, age estimation | 2015 | R. Rothe, R. Timofte, L. V. Gool |
Action recognition
Object detection and recognition
Handwriting and character recognition
Aerial images
Other images
Text data
Datasets consisting primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.Reviews
News articles
Messages
Twitter and tweets
Dialogues
Other text
Sound data
Datasets of sounds and sound features.Speech
Music
Other sounds
Signal data
Datasets containing electric signal information requiring some sort of Signal processing for further analysis.Electrical
Motion-tracking
Other signals
Physical data
Datasets from physical systems.High-energy physics
Systems
Astronomy
Earth science
Other physical
Biological data
Datasets from biological systems.Human
Animal
Plant
Microbe
Drug Discovery
Anomaly data
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created | Reference | Creator |
Numenta Anomaly Benchmark | Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted. | - | 50+ files | Comma separated values | Anomaly detection | 2016 | Numenta | |
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study | Most data files are adapted from UCI Machine Learning Repository data, some are collected from the literature. | treated for missing values, numerical attributes only, different percentages of anomalies, labels | 1000+ files | ARFF | Anomaly detection | 2016 | Campos et al. |
Question Answering data
This section includes datasets that deals with structured data.Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created | Reference | Creator |
DBpedia Neural Question Answering Dataset | A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase. | This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts. | 894,499 | Question-query pairs | Question Answering | 2018 | Hartmann, Soru, and Marx et al. |
Multivariate data
Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.Financial
Weather
Census
Transit
Internet
Games
Other multivariate
Curated repositories of datasets
As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.- OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
- PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.