Mixture of Nested Experts (MoNE) - PyTorch Implementation

This repository contains a PyTorch implementation of the Mixture of Nested Experts (MoNE) framework, as described in the paper Mixture of Nested Experts: Adaptive Processing of Visual Tokens. MoNE is designed for efficient visual token processing by dynamically allocating computational resources, reducing inference costs without sacrificing model accuracy.

Features

ExpertPreferredRouter: Dynamic routing based on token importance, directing tokens to appropriate experts. Found in mone_pytorch/routing.py
Nested Linear Projections: Includes NestedLinearExpand and NestedLinearContract, implementing nested linear projections for flexible token processing. Located in mone_pytorch/layers.py

Usage

1. ExpertPreferredRouter

The ExpertPreferredRouter assigns tokens to nested experts based on importance. Located in mone_pytorch/routing.py, this router is the core of MoNE’s dynamic token routing.

2. NestedLinearExpand and NestedLinearContract

These classes manage nested linear projections to process tokens at varying computational levels. You can find these implementations in mone_pytorch/layers.py.

Example

Below is a minimal example to demonstrate initializing and using the MoNE framework:

from mone_pytorch.routing import compute_capacity_distribution
from mone_pytorch.block import NestedBlock

# Define capacity distribution parameters
e_c = 0.6  # Effective capacity (between 0 and 1)
E = 3  # Number of experts
delta = 2  # Incentive parameter (>1)
beta = 10  # Entropy regularization parameter (>0)

# Compute capacity distribution
capacity_distribution = compute_capacity_distribution(e_c, E, delta, beta)

# Define router and layers as per model architecture
block = NestedBlock(
    dim=128,
    num_heads=8,
    num_experts=E,
    capacity_distribution=capacity_distribution,
)

...

References

For a detailed overview of MoNE, please refer to the paper: Mixture of Nested Experts: Adaptive Processing of Visual Tokens by Jain et al.

Feel free to modify sections, add specific examples, or link the paper directly.

To Do

Build MoNE nested linear layers
Build efficient triton kernels for nested linear layers
Create transformer block using MoNE components
Create training code to reproduce MoNE paper results (Imagenet-21k Classification)
Add example notebooks

Acknowledgements

xformers for the memory efficient attention implementation
dinov2 for the implementation of the DINOv2 model

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.vscode		.vscode
mone_pytorch		mone_pytorch
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
experts_choose_sanity_check.ipynb		experts_choose_sanity_check.ipynb
extract_ILSVRC.sh		extract_ILSVRC.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mixture of Nested Experts (MoNE) - PyTorch Implementation

Features

Usage

1. ExpertPreferredRouter

2. NestedLinearExpand and NestedLinearContract

Example

References

To Do

Acknowledgements

About

Releases

Packages

Languages

License

usryokousha/mone-pytorch

Folders and files

Latest commit

History

Repository files navigation

Mixture of Nested Experts (MoNE) - PyTorch Implementation

Features

Usage

1. ExpertPreferredRouter

2. NestedLinearExpand and NestedLinearContract

Example

References

To Do

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages