Skip to content

Latest commit

 

History

History
78 lines (49 loc) · 5.15 KB

dataset.md

File metadata and controls

78 lines (49 loc) · 5.15 KB

Dataset

A dataset (or data set) is a collection of data that is used for training a machine learning model.

Machine learning typically works with three datasets:

  • Training dataset

    The actual dataset that we use to train the model. The model learns weights and parameters from this data.

  • Validation dataset

    The validation set is used to evaluate a given model during the training process. It helps machine learning engineers to fine-tune the HyperParameters at model development stage. The model doesn't learn from validation dataset; and validation dataset is optional.

  • Test dataset

    The Test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained. The test dataset should more accurately evaluate how the model will be performed on new data.

See Jason Brownlee’s article for more detail.

DJL provides a number of built-in basic and standard datasets. These datasets are used to train deep learning models. This module contains the following datasets:

CV

Image Classification

  • MNIST - A small and fast handwritten digits dataset
  • Fashion MNIST - A small and fast clothing type detection dataset
  • CIFAR10 - A dataset consisting of 60,000 32x32 color images in 10 classes
  • ImageNet - An image database organized according to the WordNet hierarchy

    Note: You have to manually download the ImageNet dataset due to licensing requirements.

Object Detection

  • Pikachu - 1000 Pikachu images of different angles and sizes created using an open source 3D Pikachu model
  • Banana Detection - A testing single object detection dataset

Other CV

  • Captcha - A dataset for a grayscale 6-digit CAPTCHA task
  • Coco - A large-scale object detection, segmentation, and captioning dataset that contains 1.5 million object instances
    • You have to manually add com.twelvemonkeys.imageio:imageio-jpeg:3.11.0 dependency to your project

NLP

Text Classification and Sentiment Analysis

  • AmazonReview - A sentiment analysis dataset of Amazon Reviews with their ratings
  • Stanford Movie Review - A sentiment analysis dataset of movie reviews and sentiments sourced from IMDB
  • GoEmotions - A dataset classifying 50k curated reddit comments into either 27 emotion categories or neutral

Unlabeled Text

  • Penn Treebank Text - The text (not POS tags) from the Penn Treebank, a collection of Wall Street Journal stories
  • WikiText2 - A collection of over 100 million tokens extracted from good and featured articles on wikipedia

Other NLP

Tabular

Time Series