Skip to content

Can I offload the workload on GPU instead of CPU? #1448

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
m0sh1x2 opened this issue May 14, 2023 · 2 comments
Closed

Can I offload the workload on GPU instead of CPU? #1448

m0sh1x2 opened this issue May 14, 2023 · 2 comments
Labels

Comments

@m0sh1x2
Copy link

m0sh1x2 commented May 14, 2023

Hello,

I am testing out the cuBLAS build but at the moment I get 1000% CPU usage and 0% GPU usage:

image

Please let me know if there are any other requirements or setup to run this for initial installation I am following those steps:

  1. Clone the Repo
  2. Install the Nvidia Toolkit
  3. Run
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
  1. Then run the ./main executable with those params:
/main -m ../../models/Wizard-Vicuna-13B-Uncensored.ggml.q4_0.bin -n 1024 -p "Write 10 different ways on how to implement ML with DevOps: 1."

And get this output:

main: build = 547 (601a033)
main: seed  = 1684055753
llama.cpp: loading model from ../../models/Wizard-Vicuna-13B-Uncensored.ggml.q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 10 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0


 Write 10 different ways on how to implement ML with DevOps: 1. Implementing ML models within a containerized environment for faster deployment and scalability
2. Automating the pipeline of building, training, and deploying machine learning models through DevOps tools like Jenkins or Travis CI
3. Integrating ML into continuous integration/continuous delivery (CI/CD) pipelines to ensure accuracy and consistency in model predictions
4. Deploying predictive models into production environments using DevOps practices such as blue-green deployments, canary releases, and rollbacks
5. Using automated testing tools like Selenium or TestCafe to ensure ML models are accurate and reliable before deployment
6. Implementing machine learning algorithms within containerized applications for faster development cycles and improved scalability
7. Integrating machine learning services into infrastructure-as-code (IaC) platforms such as Terraform or CloudFormation for easier management and maintenance
8. Using DevOps tools like Ansible or Puppet to automate the deployment of machine learning models across different environments
9. Implementing machine learning workflows through code using frameworks like TensorFlow, PyTorch, or Scikit-learn within a containerized environment
10. Integrating machine learning libraries into application code for real-time predictions and faster processing times [end of text]

llama_print_timings:        load time =  3343.55 ms
llama_print_timings:      sample time =    90.61 ms /   269 runs   (    0.34 ms per token)
llama_print_timings: prompt eval time =  2363.20 ms /    20 tokens (  118.16 ms per token)
llama_print_timings:        eval time = 104664.79 ms /   268 runs   (  390.54 ms per token)
llama_print_timings:       total time = 108141.89 ms

Any help or advice will be great, unless the cuBLAS only offloads some of the memory to GPU without using the GPU itself.

Thanks

@sevenreasons
Copy link

-ngl N, --n-gpu-layers N
number of layers to store in VRAM

Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants