PaLI-GEMMA Model Reproduction

A reproduction of Google's PaLI-GEMMA (an open vision-language model), mainly for self-learning purposes. This project references Google's official code and the associated research paper. It includes detailed notes on tensor shape transformations within the network for deeper understanding.

Key Points

Focus: Documenting tensor shape transformations at each layer for in-depth learning of the model structure.
Goal: To provide a clear, readable version of PaLI-GEMMA for self-study.

Setup

Download Model Weights: Download the model weights from PaLI-GEMMA 3B on Hugging Face

Clone the repository and install dependencies:

git clone git@github.com:CazeroZ/PaliGemma_repro.git
cd PaliGemma_repro
pip install -r requirements.txt

Configure and Run the Inference Script:
- Open launch_inference.sh and modify the following variables as needed:
  - MODEL_PATH: Set this to the directory where the downloaded model weights are saved.
  - PROMPT: Update with the prompt you want to use for inference.
  - IMAGE_FILE_PATH: Set this to the path of the input image.
Run the Inference:
```
sh launch_inference.sh
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PaLI-GEMMA Model Reproduction

Key Points

Setup

Files

README.md

Latest commit

History

README.md

File metadata and controls

PaLI-GEMMA Model Reproduction

Key Points

Setup