Roadmap

This document lists general directions that core team is interested to see developed in PyTorch-Ignite.

We are using Github Projects to define our different goals: releases, particular milestones etc.

Principal goals

continue maintaining high-quality, well-tested and documented modules.
provide distributed framework support via ignite.distributed: XLA (e.g. TPU), Horovod
provide new higher-level API based on Engine to simplify the usage while keeping flexibility as a contrib module
provide helper on data management via ignite.data: sampling, multi-dataloaders
provide more intergrations with other tools to simplify Machine/Deep Learning end-to-end applications.
visibility and communications

Codebase maintenance

add typing to the whole package
adapt the code and add mypy check
merge contrib module into principal library ?

Pre-built Docker images

Provide helper docker images to quick-start with a task
https://hub.docker.com/orgs/pytorchignite

Distributed framework support

XLA devices support via pytorch/xla
Horovod
Explore DDP + RPC
Better support different types of parallelism: data, model, pipeline.

Metrics

All metrics work in distributed
- configurable distributed metrics reduce/gather methods
Minor improvements:
- better support of sklearn metrics
- Classification metrics with micro/macro options
Metrics for NLP: ROUGE, BLEU, METEOR, PPL
Metrics for GANs: FID, PPL (#998)

See also related GSoC 2021 project idea description

Higher-level API

push-button contrib trainers with AMP, distributed etc
automatic batch size via toma

See also related GSoC 2021 project idea description

Refactor Engine

Engine of 0.4.x version contains several major bugs related to the way we implemented events triggering and counting. In this case, events filtering requires state and corresponding attributes to be available which is not a nice design. To solve the following issues : https://github.com/pytorch/ignite/issues?q=is%3Aissue+is%3Aopen+label%3A%22module%3A+engine%22 it requires major Engine redesign while keeping as much as possible the backward compatibility.

`Engine` derived from `EventsDriven`

The idea is to split Engine(Serializable) -> Engine(Serializable, EventsDriven) where EventsDriven is a class responsible for events registration, triggering etc. Thus Engine will have only the logic to register necessary events and about how to run two loops.

`Engine.run_one_epoch` as a public method

Exposing run_one_epoch publicly would help user to combine their custom outer loops with Engine's one. Required here:

Tricky part is to resume from the stopped iteration if epoch length is not data size or data is an iterable.

Run/Resume logic improvements

Details

Currently, we have a bit unclear engine's behavior about when restart from the beginning and when to continue.

Currently

# (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7

# error
engine.run(data, max_epochs=4) -> ValueError: Argument max_epochs should be larger than the start epoch

# restart from 0 to 7 (As state.epoch == max_epochs(=7), this should be like that as we always do: evaluator.run(data) without any other instructions)
engine.run(data, max_epochs=7) -> Engine run starting with max_epochs=7 => state.epoch=7

# forced restart from 0 to 5
engine.state.max_epochs = None
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# forced restart from 0 to 9, instead of continue from state.epoch=7
engine.state.max_epochs = None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9

A proposition to change it slightly: "error" case and ugly engine.state.max_epochs=None solution.

Proposed API

# SAME. (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# SAME. continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7

# As max_epochs=4 <= state.epoch=7 => restart
engine.run(data, max_epochs=4) -> Engine run starting with max_epochs=4 => state.epoch=4

# restart from 0 to 4
engine.run(data, max_epochs=4) -> Engine run starting with max_epochs=4 => state.epoch=4

# Now (not forced) restart from 0 to 3 (as max_epochs=3 <= state.epoch=4 => restart)
engine.run(data, max_epochs=3) -> Engine run starting with max_epochs=3 => state.epoch=3

# SOMETHING TO CHANGE HERE. Forced restart from 0 to 9, instead of continue from state.epoch=3
engine.state.max_epochs = None  # maybe, engine.reset() -> state.epoch=state.iteration=0,state.max_epochs=state.max_iters=None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9

# In case of max_iters, we'll have to do:
engine.run(data, max_iters=100) -> Engine run starting with max_iters=100 => state.iteration=100
engine.state.max_iters = None
engine.run(data, max_iters=100) -> Engine run starting with max_iters=100 => state.iteration=100
# So there is no uniform API to restart engine...

Fix #1521 issue

Pipeline Parallelism support

Helper on data management

better and simple coverage of multi-dataloaders use-cases, e.g. GAN, SSL, etc

Integrations

Verify compatibility (if ignite is not blocking) writing applications for Federated Learning
Verify compatibility (if ignite is not blocking) writing applications with Distributed RPC framework

Communications

More applications and successful stories with PyTorch-Ignite
Showcase via ClearML Ignite server :
- more experiments with Ignite from our users

PyTorch-Ignite presented to you with love by PyTorch community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap

Principal goals

Codebase maintenance

Pre-built Docker images

Distributed framework support

Metrics

Higher-level API

Refactor Engine

`Engine` derived from `EventsDriven`

`Engine.run_one_epoch` as a public method

Run/Resume logic improvements

Pipeline Parallelism support

Helper on data management

Integrations

Communications

Clone this wiki locally

Roadmap

Principal goals

Codebase maintenance

Pre-built Docker images

Distributed framework support

Metrics

Higher-level API

Refactor Engine

Engine derived from EventsDriven

Engine.run_one_epoch as a public method

Run/Resume logic improvements

Pipeline Parallelism support

Helper on data management

Integrations

Communications

Clone this wiki locally

`Engine` derived from `EventsDriven`

`Engine.run_one_epoch` as a public method