TinyLlama is a relatively small large language model with impressive capabilities for its size. The goal of this project is to serve as a simpler implementation of TinyLlama. The only required dependency is PyTorch.
- Install PyTorch.
- Download and extract this repository.
- Run
main.py
to chat with the llama. - Press CTRL + C to interrupt the response.
- Press CTRL + C again to exit the program.
- CUDA will be used if available, but requires approximately 3 GB of VRAM. If you do not have that much VRAM, you can set the computation device manually in
main.py
. - Only inference is supported. Training is not supported.
- Chat history is currently not supported.
- This project includes a pure Python implementation of a subset of the Sentencepiece tokenizer. It is not as fast as the C++ implementation, but it is sufficient for this project.