A reproduction of Google's PaLI-GEMMA (an open vision-language model), mainly for self-learning purposes. This project references Google's official code and the associated research paper. It includes detailed notes on tensor shape transformations within the network for deeper understanding.
- Focus: Documenting tensor shape transformations at each layer for in-depth learning of the model structure.
- Goal: To provide a clear, readable version of PaLI-GEMMA for self-study.
-
Download Model Weights: Download the model weights from PaLI-GEMMA 3B on Hugging Face
-
Clone the repository and install dependencies:
git clone git@github.com:CazeroZ/PaliGemma_repro.git cd PaliGemma_repro pip install -r requirements.txt
-
Configure and Run the Inference Script:
- Open
launch_inference.sh
and modify the following variables as needed:MODEL_PATH
: Set this to the directory where the downloaded model weights are saved.PROMPT
: Update with the prompt you want to use for inference.IMAGE_FILE_PATH
: Set this to the path of the input image.
- Open
-
Run the Inference:
sh launch_inference.sh