Skip to content

RTX 5090 Performance on Ubuntu Gemma 3 #9701

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
MMaturax opened this issue Mar 12, 2025 · 13 comments
Open

RTX 5090 Performance on Ubuntu Gemma 3 #9701

MMaturax opened this issue Mar 12, 2025 · 13 comments
Labels
bug Something isn't working

Comments

@MMaturax
Copy link

MMaturax commented Mar 12, 2025

What is the issue?

I'm getting the following results with the RTX 5090 on Ubuntu For comparison, I tested similar models, all using the default q4 quantization.

Performance Comparison:
Gemma2:9B = ~150 tokens/s
vs
Gemma3:4B = ~130 tokens/s 🤔

Gemma3:12B = ~78 tokens/s 🤔?? vs
Qwen2.5:14B = ~120 tokens/s

Gemma3:27B = ~50 tokens/s
vs
Gemma2:27B = ~76 tokens/s
Qwen2.5:32B = ~64 tokens/s
DeepSeek-R1:32B = ~64 tokens/s
Mistral-Small:24B = ~93 tokens/s

It seems like something is off—Gemma 3's performance is surprisingly slow even on an RTX 5090. No matter how good the model is, this kind of slowdown is a significant drawback.

Gemma 2 series—it's my favorite open model series so far. However, I really hope the Gemma 3 performance issue gets addressed soon.

Is this slowness due to the model itself or could there be a different problem? What are your results?

Debug Log

debug-log.txt

Prompt

zemin@ai-server:~$ ollama run gemma3:4b --verbose
>>> hello
Hello there! How’s your day going? Is there anything I can help you with today? 😊 

Do you want to:

*   Chat about a topic?
*   Get help with something (like writing, research, or calculations)?
*   Just have a friendly conversation?

total duration:       501.89252ms
load duration:        33.09909ms
prompt eval count:    10 token(s)
prompt eval duration: 33ms
prompt eval rate:     303.03 tokens/s
eval count:           62 token(s)
eval duration:        434ms
eval rate:            142.86 tokens/s
>>> 5+8/16=?
Okay, let's solve that!

5 + 8 / 16 = 5 + 0.5 = 5.5

So the answer is **5.5** 

Do you want to try another math problem?

total duration:       474.17708ms
load duration:        32.674048ms
prompt eval count:    88 token(s)
prompt eval duration: 14ms
prompt eval rate:     6285.71 tokens/s
eval count:           52 token(s)
eval duration:        409ms
eval rate:            127.14 tokens/s
>>> 
zemin@ai-server:~$ ollama --version
ollama version is 0.6.0
zemin@ai-server:~$ hostnamectl
 Static hostname: ai-server
       Icon name: computer-desktop
         Chassis: desktop 🖥️
Operating System: Ubuntu 24.04.2 LTS              
          Kernel: Linux 6.8.0-55-generic
    Architecture: x86-64
 Hardware Vendor: ASUS
  Hardware Model: ROG STRIX B850-E GAMING WIFI
Firmware Version: 0825
   Firmware Date: Fri 2024-11-29
    Firmware Age: 3month 1w 5d

OS

Ubuntu 24.04.2 LTS

GPU

Nvidia RTX 5090

NVIDIA-SMI 570.86.16
Driver Version: 570.86.16
CUDA Version: 12.8

CPU

AMD 7950X3D

Ollama version

0.6.0

Ollama Settings

Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"

Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=f16"
Environment="CUDA_VISIBLE_DEVICES=0"

Environment="OLLAMA_DEBUG=1"
@MMaturax MMaturax added the bug Something isn't working label Mar 12, 2025
@MMaturax
Copy link
Author

MMaturax commented Mar 12, 2025

If there's an issue, I hope the information I provided helps. What I'm curious about is whether this slow performance indicates a problem or if it's just the model's standard speed. I would expect at least 200 tokens/s for 4B Q4.

LLaMA 3.1: Even 8B Q4 runs at 190+ tokens/s.

@rick-github
Copy link
Collaborator

There's a performance hit for gemma3 caused by flash attention. It's particularly noticeable with OLLAMA_KV_CACHE_TYPE of q4_0 or q8_0.

@MMaturax
Copy link
Author

MMaturax commented Mar 12, 2025

I'm still testing, and the response quality of the models is very good compared to competitors. However, I hope this performance issue can be resolved. I don't think it's usual for a 4B model to run slower than an 8B model.

Image

Image

@MMaturax
Copy link
Author

There's a performance hit for gemma3 caused by flash attention. It's particularly noticeable with OLLAMA_KV_CACHE_TYPE of q4_0 or q8_0.

When I set OLLAMA_FLASH_ATTENTION=0, the inference speed increased. Is this the issue you were referring to? Normally, disabling Flash Attention should make it slower, but the opposite happened.

Image

@MMaturax
Copy link
Author

MMaturax commented Mar 12, 2025

Even after disabling Flash Attention, there was no performance drop in Gemma 2, LLaMA 3.1, and all other models. Either the performance remained the same or actually improved.

Could the issue be caused by Ollama's own Flash Attention implementation rather than the models themselves?


There was a slight performance improvement.
Image

There was no change.
Image

There was no change.
Image

There was a slight performance improvement.
Image

There was no change.
Image

There was no change.
Image

There was no change.
Image

Note:
In the tests shared by the Reddit user, Gemma 3:27B runs at 22 tokens/s, while QWQ:32B runs at 17.90 tokens/s.
https://www.reddit.com/r/ollama/comments/1j9uxlr/new_google_gemma3_inference_speeds_on_macbook_pro/

However, on my system, it's the opposite—QWQ is faster:
Gemma 3:27B → 53 tokens/s
QWQ:32B → 62 tokens/s

Update 2:

I ran similar tests using llama.cpp, and the results were the same as with Ollama when Flash Attention was disabled.

For example, with Gemma 3 12B, I got 85.48 tokens per second.

However, enabling Flash Attention actually slows it down, and disabling it speeds it up. Both Ollama and llama.cpp behave exactly the same way.

Has Flash Attention become useless? Just a few months ago, it provided a significant performance boost even with Q4 models, but now it seems to be doing the opposite.

(base) zemin@maturax:~/llama3/bin$ ./llama-cli -m ../models/gemma-3-12b-it-Q4_K_M.gguf -ngl 999 -p "5+8/16=?"  -ptc 1 -st
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: KHR_coopmat
build: 4879 (f08f4b31) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (NVIDIA GeForce RTX 5090) - 32607 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 626 tensors from ../models/gemma-3-12b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3
llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   4:                         general.size_label str              = 12B
llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   6:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv   7:                    gemma3.embedding_length u32              = 3840
llama_model_loader: - kv   8:                         gemma3.block_count u32              = 48
llama_model_loader: - kv   9:                 gemma3.feed_forward_length u32              = 15360
llama_model_loader: - kv  10:                gemma3.attention.head_count u32              = 16
llama_model_loader: - kv  11:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  13:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  14:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  16:             gemma3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  17:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  18:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 106
llama_model_loader: - kv  26:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  31:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q4_K:  288 tensors
llama_model_loader: - type q6_K:   49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 6.79 GiB (4.96 BPW) 
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3840
print_info: n_layer          = 48
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 15360
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 12B
print_info: model params     = 11.77 B
print_info: general.name     = Gemma 3
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 106 '<end_of_turn>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:      Vulkan0 model buffer size =  6956.32 MiB
load_tensors:   CPU_Mapped model buffer size =   787.69 MiB
.................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 0.125
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =  1536.00 MiB
llama_init_from_model: KV self size  = 1536.00 MiB, K (f16):  768.00 MiB, V (f16):  768.00 MiB
llama_init_from_model: Vulkan_Host  output buffer size =     1.00 MiB
llama_init_from_model:    Vulkan0 compute buffer size =   519.62 MiB
llama_init_from_model: Vulkan_Host compute buffer size =    23.51 MiB
llama_init_from_model: graph nodes  = 1927
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model


system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

sampler seed: 2736984370
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

user
5+8/16=?
model
To solve this, we need to follow the order of operations (PEMDAS/BODMAS):

1.  **Division:** 8 / 16 = 0.5
2.  **Addition:** 5 + 0.5 = 5.5

Therefore, 5 + 8/16 = **5.5**
 [end of text]


llama_perf_sampler_print:    sampling time =       6.09 ms /    92 runs   (    0.07 ms per token, 15094.34 tokens per second)
llama_perf_context_print:        load time =    1364.86 ms
llama_perf_context_print: prompt eval time =      76.30 ms /    16 tokens (    4.77 ms per token,   209.70 tokens per second)
llama_perf_context_print:        eval time =     877.44 ms /    75 runs   (   11.70 ms per token,    85.48 tokens per second)
llama_perf_context_print:       total time =     978.34 ms /    91 tokens
(base) zemin@maturax:~/llama3/bin$ 

@zhangzhongpeng02
Copy link

on my computer, qwq-32b only cost 20G GPU RAM , but gemma3-27b cost 20G GPU RAM + 20G RAM why??

@the-hampel
Copy link

I have the same issue on my Mac M3 Max using ollama 0.6 . Other models like qwen2.5-coder:32b cost exactly 20GB VRAM but with gemma3:27b I also use something like ~20GB VRAM + ~20G RAM . The model runs okayish fast with ~9 token/s compared to qwen2.5-coder:32b : 11 token/s . On my Linux machine with an rtx 4090 I can confirm this behavior. Loading qwen2.5-coder:32b does not cost any system memory but loading gemma3:27b uses around ~20GB of system memory even though ollama clearly runs the interference only on the GPU getting around 40 token/s! Why is so much system memory used when starting gemma3?

@laosuan
Copy link

laosuan commented Mar 13, 2025

I test 4090, Gemma 3:27B → 40 tokens/s

@rick-github
Copy link
Collaborator

Has Flash Attention become useless? Just a few months ago, it provided a significant performance boost even with Q4 models, but now it seems to be doing the opposite.

Generally speaking, the version of ollama hasn't affected the performance effect of flash attention. The exception is the falcon3 model, which got a tps increase for q4_0 and q8_0 sometime between 0.5.7 and 0.5.13. For some models, q4_0 and q8_0 KV cache type results in a significant decrease in tps.

Image

The main advantage of FA is to decrease the size of the context cache.

Image

@MMaturax
Copy link
Author

MMaturax commented Mar 13, 2025

Mac M3 Max'imde ollama 0.6 kullanırken aynı sorunla karşılaşıyorum. Qwen2.5-coder:32b gibi diğer modeller tam olarak 20 GB VRAM'e mal oluyor ancak gemma3:27b ile ben de ~20 GB VRAM + ~20 GB RAM gibi bir şey kullanıyorum. Model, qwen2.5-coder:32b: 11 token/s ile karşılaştırıldığında ~9 token/s ile fena sayılmayacak kadar hızlı çalışıyor. Rtx 4090'lı Linux makinemde bu davranışı doğrulayabiliyorum. Qwen2.5-coder:32b'yi yüklemek herhangi bir sistem belleğine mal olmuyor ancak gemma3:27b'yi yüklemek yaklaşık ~20 GB sistem belleği kullanıyor, ollama açıkça paraziti yalnızca GPU'da çalıştırıyor ve yaklaşık 40 token/s alıyor! Gemma3 başlatılırken neden bu kadar çok sistem belleği kullanılıyor?

That's really interesting. I observed the same issue with Gemma 3 27B. While VRAM usage is 21GB/32GB, it also uses around 18GB of system RAM at the same time. KV cache is in FP16.

IDLE:
Image

Model is loaded:
Image

@rick-github I know that Flash Attention isn't primarily designed to boost speed, but with KV cache in FP16, it shouldn't be slowing things down either, right?

I think if KV cache is in F16, then FA doesn’t impact memory usage anyway, so it’s better to leave it disabled.

Also, LLaMA 3.1 8B runs at 200 tokens/s, while Gemma 3 4B only reaches 160 tokens/s.
Do you think this is an issue, or could it be a natural result of the model's training process or multimodal nature?

@woshitoutouge
Copy link

woshitoutouge commented Mar 14, 2025

same issue with 4070ti super 16G.

using ollama:
qwen2.5:14b runs at 50+ tokens/s, gemma3:12b runs at 30+ tokens/s

using llama.cpp
gemma3:12b runs at 60 tokens/s

@MMaturax
Copy link
Author

MMaturax commented Mar 22, 2025

🔧 v0.6.3-rc0 Performance Improvements

I observed significant speed improvements across the board in v0.6.3-rc0 compared to v0.6.2:

📊 Token generation speed (tokens/sec):

Model v0.6.2 v0.6.3-rc0 Improvement
gemma3:27b 52 68 🔼 +30.8%
gemma3:12b 87 113 🔼 +29.9%
gemma3:4b 150 205 🔼 +36.7%

In addition to better throughput, v0.6.3-rc0 also uses significantly less system RAM

Great work 🚀

0.6.3-rc0

gemma3:27b

Image

gemma3:12b

Image

gemma3:4b

Image

@coder543
Copy link

I'm noticing that the GPU offload calculations are very inaccurate for Gemma3.

~> ollama --version
ollama version is 0.6.3-rc0
~> ollama ps
NAME          ID              SIZE     PROCESSOR         UNTIL              
gemma3:27b    30ddded7fba6    25 GB    5%/95% CPU/GPU    4 minutes from now    
~> nvidia-smi
Sat Mar 22 14:39:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   55C    P2            120W /  420W |   16513MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1303816      C   /usr/local/bin/ollama                       16506MiB |
+-----------------------------------------------------------------------------------------+

This is with 9000 context. If I lower it to 8192 context, then it will switch to 100% GPU offload, but... at 9000 context, it is using less than 17GB of the 24GB of VRAM that is available, which is leaving a lot on the table.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants