This repository contains an end-to-end prototype for Transformer² (“Transformer-Squared”)—a self-adaptive Large Language Model (LLM) framework. It demonstrates:
- Singular Value Fine-tuning (SVF) for parameter-efficient specialization.
- PPO-based RL for directly optimizing the latent (\mathbf{z})-vectors.
- Multi-domain adaptivity through different strategies:
- Prompt-based classification,
- Classifier-based (placeholder here), and
- Few-shot mixture of experts via CEM.
- Overview
- Features & Highlights
- Repository Structure
- Dependencies
- Usage
- Implementation Details
- Extending & Customizing
- FAQ
Transformer² is a research framework enabling LLMs to dynamically adapt to multiple domains or tasks by only modifying a minimal number of parameters—specifically the singular values of the model’s weight matrices. This approach is especially powerful when:
- You want to avoid full fine-tuning of billions of parameters.
- You need to add or specialize for new tasks after pre-training.
- You need the model to quickly adapt to changing requirements at inference time.
In this production-oriented example, we demonstrate how to:
- Compute SVD of select weight matrices from a base model.
- Attach small (\mathbf{z})-vectors that scale those singular values to produce new weight matrices.
- Use PPO to optimize those (\mathbf{z})-vectors for domain-specific tasks.
- Perform adaptation in real time for new tasks or domains.
- Parameter Efficiency: We only learn small vectors, leaving the vast majority of the base model frozen.
- Stable RL Fine-tuning: Uses trl's PPO implementation, integrated with Accelerate.
- Scalability: The pipeline works for smaller or larger models. For very large models (e.g., 70B), you can rely on advanced parallelization.
- Modularity: Adapt the code to new tasks or sub-domains by plugging in your own data, reward functions, and classification prompts.
.
├── finetuner.py # Main code illustrating Transformer² for production
├── README.md # This README file
└── requirements.txt # Example dependencies (optional)
transformer2_production.py
SvdComponent
& SVD logicSVFWrapper
for storing and managing the (\mathbf{z})-parametersSVF_PPOTrainer
leveraging RL (PPO)- Adaptation methods (prompt-based, few-shot w/ CEM, etc.)
- CLI setup with
argparse
Below is a minimal set of packages to run this script:
- Python 3.10+
- PyTorch 2.0+ (with CUDA if using GPU)
- Transformers (\geq 4.30.0)
- Accelerate (\geq 0.18.0)
- TRL (\geq 0.5.0) (for PPO)
- numpy, scipy (for sampling, truncated normal, stats)
You can place them in a requirements.txt
:
torch>=2.0.0
transformers>=4.30.0
accelerate>=0.18.0
trl>=0.5.0
numpy
scipy
Install with:
pip install -r requirements.txt
-
Clone or copy this repository.
-
Install the dependencies (above).
-
Run
transformer2_production.py
from your terminal, for example:python transformer2_production.py \ --model_name DeepSeekInstruct/large \ --output_dir outputs \ --logging_dir logs \ --batch_size 4 \ --max_epochs 1 \ --learning_rate 1e-4
Adjust flags as needed.
-
The script will:
- Load the base model + tokenizer from
--model_name
- Decompose the selected parameters with SVD.
- Wrap them in an
SVFWrapper
. - Launch a PPO training loop on a small mocked “math domain” dataset.
- Save a checkpoint of the (\mathbf{z})-vectors.
- Show how to do prompt-based adaptation on a sample question.
- Load the base model + tokenizer from
-
Inspect results in the logs and the
outputs/
directory.
- SVD is computed for each parameter matrix ( W ) (e.g., MLP or attention weight).
- We store
(U, S, V^T)
in aSvdComponent
. SVFWrapper
holds:- The base model in frozen mode.
- A
nn.ParameterDict
of learned (\mathbf{z}) vectors, each of shape = rank((W)). - A method
patch_weights()
that reconstructs ( W' = U \cdot \operatorname{diag}(S \cdot z) \cdot V^T ) to modify the base model’s parameters in-place.
- We rely on PPO from trl to train only the (\mathbf{z})-vectors.
- The script:
- Creates a deep copy of the base model as a “reference model” (frozen) for KL divergence.
- Calls
PPOTrainer.step(...)
with(query_tensors, response_tensors, rewards)
to update the policy. - Uses a naive substring-based reward in the example. In real usage, you can incorporate advanced checks, code execution, or specialized reward models.
The code includes the building blocks for typical Transformer² adaptation:
-
Prompt-based:
- We generate a classification label for an incoming question using the same (or a specialized) LLM.
- We load or use the (\mathbf{z})-vector that corresponds to the predicted domain.
-
Classifier-based:
- Similar to prompt-based, but we might train a small classification head or LLM-based classifier.
-
Few-shot interpolation:
- Uses a Cross-Entropy Method (CEM) or other optimization over possible alpha-coefficients that linearly combine multiple experts’ (\mathbf{z})-vectors.
- This script includes a function
compute_cem_interpolation(...)
, which demonstrates how to do a global alpha search.
-
Larger Models
- For 7B+ or 70B+ models, consider FSDP or DeepSpeed Zero to handle memory scaling.
-
Custom Reward
- For code tasks, you can run unit tests on the generated code and assign a reward for correct solutions.
- For knowledge tasks, you might use a specialized reward model.
-
Multiple Domain Experts
- You can create multiple sets of (\mathbf{z})-vectors (one per domain: math, code, reasoning, vision, etc.) and store them in a dictionary or separate
.pt
files. - Combine them with the few-shot approach if a new domain arises that partially intersects with existing ones.
- You can create multiple sets of (\mathbf{z})-vectors (one per domain: math, code, reasoning, vision, etc.) and store them in a dictionary or separate
-
Logging & Monitoring
- Switch PPO’s config:
log_with="wandb"
or"tensorboard"
for real-time metric tracking. - For advanced setups, log to custom DB or S3.
- Switch PPO’s config:
-
Deployment
- After training, the (\mathbf{z})-vectors are extremely small (megabytes or even kilobytes).
- Deploy them as sidecar “experts.” At inference time, do a quick 2-pass adaptation to pick or combine the correct domain expert.
-
Why do we only decompose certain layers?
- Decomposing every layer can be expensive. Often, you decompose only MLP or attention projection layers. Empirically, even partial-layer SVF yields good gains and reduces overhead.
-
How can I store (\mathbf{z})-vectors for multiple tasks?
- Either keep separate checkpoint files—e.g.,
zparams_math.pt
,zparams_code.pt
—or store them in a dictionary with keys for each domain.
- Either keep separate checkpoint files—e.g.,
-
What if my model is too large for SVD on a single GPU?
- Use [FAISS GPU-based SVD], or break the matrix into sub-blocks, or run CPU-based SVD with enough RAM, or do distributed SVD. Some people also do approximate SVD for extremely large weight matrices.
-
Can I use LoRA or any other method in synergy?
- Yes. In principle, you can combine LoRA with SVD-based parameterization. Transformer² is about dynamic adaptation; the exact parameterization can be flexible.
-
Where do I place reward modeling?
- Reward modeling can be integrated in the
SVF_PPOTrainer.train_on_prompts(...)
function. Instead of naive substring checks, you’d do more sophisticated scoring.
- Reward modeling can be integrated in the