This repository contains a PyTorch implementation of the Mixture of Nested Experts (MoNE) framework, as described in the paper Mixture of Nested Experts: Adaptive Processing of Visual Tokens. MoNE is designed for efficient visual token processing by dynamically allocating computational resources, reducing inference costs without sacrificing model accuracy.
- ExpertPreferredRouter: Dynamic routing based on token importance, directing tokens to appropriate experts. Found in
mone_pytorch/routing.py
- Nested Linear Projections: Includes
NestedLinearExpand
andNestedLinearContract
, implementing nested linear projections for flexible token processing. Located inmone_pytorch/layers.py
The ExpertPreferredRouter
assigns tokens to nested experts based on importance. Located in mone_pytorch/routing.py
, this router is the core of MoNE’s dynamic token routing.
These classes manage nested linear projections to process tokens at varying computational levels. You can find these implementations in mone_pytorch/layers.py
.
Below is a minimal example to demonstrate initializing and using the MoNE framework:
from mone_pytorch.routing import compute_capacity_distribution
from mone_pytorch.block import NestedBlock
# Define capacity distribution parameters
e_c = 0.6 # Effective capacity (between 0 and 1)
E = 3 # Number of experts
delta = 2 # Incentive parameter (>1)
beta = 10 # Entropy regularization parameter (>0)
# Compute capacity distribution
capacity_distribution = compute_capacity_distribution(e_c, E, delta, beta)
# Define router and layers as per model architecture
block = NestedBlock(
dim=128,
num_heads=8,
num_experts=E,
capacity_distribution=capacity_distribution,
)
...
For a detailed overview of MoNE, please refer to the paper: Mixture of Nested Experts: Adaptive Processing of Visual Tokens by Jain et al.
Feel free to modify sections, add specific examples, or link the paper directly.
- Build MoNE nested linear layers
- Build efficient triton kernels for nested linear layers
- Create transformer block using MoNE components
- Create training code to reproduce MoNE paper results (Imagenet-21k Classification)
- Add example notebooks