A dataset (or data set) is a collection of data that is used for training a machine learning model.
Machine learning typically works with three datasets:
-
Training dataset
The actual dataset that we use to train the model. The model learns weights and parameters from this data.
-
Validation dataset
The validation set is used to evaluate a given model during the training process. It helps machine learning engineers to fine-tune the HyperParameters at model development stage. The model doesn't learn from validation dataset; and validation dataset is optional.
-
Test dataset
The Test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained. The test dataset should more accurately evaluate how the model will be performed on new data.
See Jason Brownlee’s article for more detail.
DJL provides a number of built-in basic and standard datasets. These datasets are used to train deep learning models. This module contains the following datasets:
- MNIST - A small and fast handwritten digits dataset
- Fashion MNIST - A small and fast clothing type detection dataset
- CIFAR10 - A dataset consisting of 60,000 32x32 color images in 10 classes
- ImageNet - An image database organized according to the WordNet hierarchy
Note: You have to manually download the ImageNet dataset due to licensing requirements.
- Pikachu - 1000 Pikachu images of different angles and sizes created using an open source 3D Pikachu model
- Banana Detection - A testing single object detection dataset
- Captcha - A dataset for a grayscale 6-digit CAPTCHA task
- Coco - A large-scale object detection, segmentation, and captioning dataset that contains 1.5 million object instances
- You have to manually add
com.twelvemonkeys.imageio:imageio-jpeg:3.11.0
dependency to your project
- You have to manually add
- AmazonReview - A sentiment analysis dataset of Amazon Reviews with their ratings
- Stanford Movie Review - A sentiment analysis dataset of movie reviews and sentiments sourced from IMDB
- GoEmotions - A dataset classifying 50k curated reddit comments into either 27 emotion categories or neutral
- Penn Treebank Text - The text (not POS tags) from the Penn Treebank, a collection of Wall Street Journal stories
- WikiText2 - A collection of over 100 million tokens extracted from good and featured articles on wikipedia
- Stanford Question Answering Dataset (SQuAD) - A reading comprehension dataset with text from wikipedia articles
- Tatoeba English French Dataset - An english-french translation dataset from the Tatoeba Project
- Airfoil Self-Noise - A 6 feature dataset from NASA tests of airfoils
- Ames House # - A 80 feature dataset to predict house prices
- Movielens 100k - A 6 feature dataset of movie ratings on 1682 movies from 943 users