This is a repo as part of the foundation-model-stack organization which is used for new features staged to be integrated with foundation-model-stack. This repo is the home for extensions, research and/or in-development work, and fms-based models trained by IBM.
pip install -e .
MLPSpeculator
: a lightweight speculator model that can be used along-side a generative model to speed up inference (currently deployed in IBM TGIS with training in fms-fsdp)PagedKVCacheManager
: an implementation of kv-cache management that provides a user with the proper input to use paged-attention with their own models (currently deployed in IBM TGIS)PagedLLaMA
: a LLaMA implementation that uses paged-attention in Multi-Head Attention. This model is compilable without graph breaks.speculative generation
: a reference implementation of speculative generate using PagedKVCacheManager and MLPSpeculator
This repo follows a similar structure to that of foundation-model-stack
fms_extras/models/
- Pure pytorch implementations of popular model architectures, without requiring any specific common interface beyondnn.Module
. Each model configuration is registered withfms.models.register_model()
so that instances can be obtained throughfms.models.get_model('architecture', 'variant', '/path/to/data')
. Each model can also register sources/formats/versions of data to load (e.g. checkpoints provided by meta, HF, or trained from this repo).fms_extras/models/hf/
- Adapters that compose our native PyTorch FMS model architecture implementations in HF-compatible wrapper interfaces. Each FMS model implements an adapter, and adapted instances are obtained viafms.models.hf.to_hf_api(model)
fms_extras/utils/
- Other operators useful in working with LLMs. These include aspeculative_generate()
function,PagedKVCacheManager
class for easy-to-use kv-cache management with paged attention kernels, etc.scripts/
- Various scripts for inference (paged generation and speculative generation)csrc/
- Custom kernels used in fms-extra, currently related to paged-attention