GitHub - PurnaChandraPanda/quantized-model-inference

Model quantized format

.gguf

Details

Host quantized gguf format llm model on llama-cpp-python web server package.
llama-cpp-python[server] module is packaged inside docker image.
llama-cpp-python/docker discusses a lot on different variety of docker image configuration.
This sample talks about the azureml extension way of openblas docker image.
In azureml enviornment, leveraging one of base images from azureml and then taking advantage of rest of llama-cpp-python docker files.
With llama-cpp-python[server], the advantage is that it offers an OpenAI API compatible web server endpoints.
The azureml scoring script takes cognizance of compability points with azureml as well as llama-cpp-python[server].

Cuda support in inferencing

Docker image needs to be prepped with nvidia/cuda support. On top of it, azureml base image is used for azureml inferencing support.
Current work on docker with cuda in azureml is an extension of llama-cpp-python/docker/cuda.
Note the critical part with right cuda support in A100 machines is in:

RUN CMAKE_ARGS="-DGGML_CUDA=on -DGGML_CUDA_FORCE_CUBLAS=on -DLLAVA_BUILD=off -DCMAKE_CUDA_ARCHITECTURES=80" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

Over here, {DCMAKE_CUDA_ARCHITECTURES} holds importance which with 80 as value, is actually makes it cuda compatible.
Additionally, on inferencing script side, a validation logic is added whether host is GPU enabled. If yes, add another param --n_gpu_layers=-1 (in server end) as per llama-cpp-python web server docs.

        # Check in env if cuda is visible. 
        ## If yes, load model on GPUs for inferencing. Else, load model CPUs for inferencing.
        if 'NVIDIA_VISIBLE_DEVICES' in env:
            cmd = ["python", "-m", "llama_cpp.server", "--model", self._model_path, "--n_gpu_layers", "-1"]

Pre-requisites

In azureml compute instance, use v2 conda env and updated azure-ai-ml package.

conda activate azureml_py310_sdkv2
pip install -U azure-ai-ml

This sample is tested with python package: llama-cpp-python==0.3.2, and also 0.3.5 (latest one).

Run

In any model inferencing, following steps are carried out as base actions.

Register model asset as custom model
Register environment asset as custom environment
Create managed online endpoint
Create managed online deployment for the endpoint
Test the online endpoint

Inference deepseek-r1-gguf model

Download model and register the large size deepseek-r1-gguf model from hf
Run the notebook for deepseek-r1 model hosting and inferencing

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
env		env
inference-deepseekr1-gguf-gpu		inference-deepseekr1-gguf-gpu
inference-phi3q-gguf-gpu		inference-phi3q-gguf-gpu
inference-phi3q-gguf		inference-phi3q-gguf
inference-tinyllama1.1b-gguf-gpu		inference-tinyllama1.1b-gguf-gpu
inference-tinyllama1.1b-gguf		inference-tinyllama1.1b-gguf
onlinescoring		onlinescoring
payload		payload
.gitignore		.gitignore
known-issues.md		known-issues.md
readme.md		readme.md
repo-mgmt.md		repo-mgmt.md
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model quantized format

Details

Cuda support in inferencing

Pre-requisites

Run

Inference deepseek-r1-gguf model

Inference phi3 gguf model

Inference tinyllama1.1b gguf model

Inference phi3 gguf model - cuda support

Inference tinyllama1.1b gguf model - cuda support

About

Releases

Packages

Languages

PurnaChandraPanda/quantized-model-inference

Folders and files

Latest commit

History

Repository files navigation

Model quantized format

Details

Cuda support in inferencing

Pre-requisites

Run

Inference deepseek-r1-gguf model

Inference phi3 gguf model

Inference tinyllama1.1b gguf model

Inference phi3 gguf model - cuda support

Inference tinyllama1.1b gguf model - cuda support

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages