Can I offload the workload on GPU instead of CPU? #1448

m0sh1x2 · 2023-05-14T09:19:48Z

Hello,

I am testing out the cuBLAS build but at the moment I get 1000% CPU usage and 0% GPU usage:

Please let me know if there are any other requirements or setup to run this for initial installation I am following those steps:

Clone the Repo
Install the Nvidia Toolkit
Run

mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

Then run the ./main executable with those params:

/main -m ../../models/Wizard-Vicuna-13B-Uncensored.ggml.q4_0.bin -n 1024 -p "Write 10 different ways on how to implement ML with DevOps: 1."

And get this output:

main: build = 547 (601a033)
main: seed  = 1684055753
llama.cpp: loading model from ../../models/Wizard-Vicuna-13B-Uncensored.ggml.q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 10 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0


 Write 10 different ways on how to implement ML with DevOps: 1. Implementing ML models within a containerized environment for faster deployment and scalability
2. Automating the pipeline of building, training, and deploying machine learning models through DevOps tools like Jenkins or Travis CI
3. Integrating ML into continuous integration/continuous delivery (CI/CD) pipelines to ensure accuracy and consistency in model predictions
4. Deploying predictive models into production environments using DevOps practices such as blue-green deployments, canary releases, and rollbacks
5. Using automated testing tools like Selenium or TestCafe to ensure ML models are accurate and reliable before deployment
6. Implementing machine learning algorithms within containerized applications for faster development cycles and improved scalability
7. Integrating machine learning services into infrastructure-as-code (IaC) platforms such as Terraform or CloudFormation for easier management and maintenance
8. Using DevOps tools like Ansible or Puppet to automate the deployment of machine learning models across different environments
9. Implementing machine learning workflows through code using frameworks like TensorFlow, PyTorch, or Scikit-learn within a containerized environment
10. Integrating machine learning libraries into application code for real-time predictions and faster processing times [end of text]

llama_print_timings:        load time =  3343.55 ms
llama_print_timings:      sample time =    90.61 ms /   269 runs   (    0.34 ms per token)
llama_print_timings: prompt eval time =  2363.20 ms /    20 tokens (  118.16 ms per token)
llama_print_timings:        eval time = 104664.79 ms /   268 runs   (  390.54 ms per token)
llama_print_timings:       total time = 108141.89 ms

Any help or advice will be great, unless the cuBLAS only offloads some of the memory to GPU without using the GPU itself.

Thanks

The text was updated successfully, but these errors were encountered:

sevenreasons · 2023-05-14T11:46:36Z

-ngl N, --n-gpu-layers N
number of layers to store in VRAM

github-actions · 2024-04-09T01:09:14Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

FireMasterK mentioned this issue Jun 13, 2023

Support for --n-gpu-layers mudler/LocalAI#586

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I offload the workload on GPU instead of CPU? #1448

Can I offload the workload on GPU instead of CPU? #1448

m0sh1x2 commented May 14, 2023

sevenreasons commented May 14, 2023

github-actions bot commented Apr 9, 2024

Can I offload the workload on GPU instead of CPU? #1448

Can I offload the workload on GPU instead of CPU? #1448

Comments

m0sh1x2 commented May 14, 2023

sevenreasons commented May 14, 2023

github-actions bot commented Apr 9, 2024