ONNX export failure for models invoking SDPA attention #28610

BowenBao · 2024-01-19T19:25:25Z

ValueError: Attention using SDPA can not be traced with torch.jit.trace when no attention_mask is provided. To solve this issue, please either load your model with the argument attn_implementation="eager" or pass an attention_mask input when tracing the model.

There has been some discussion about its possible resolutions in the ONNX exporter team. I'd like to post an issue here as well to seek advice and preferences.

Check torch.jit.is_tracing() and fallback to eager attn implementation if needed.
Create attention_mask before passing to SDPA if it is None.
Support SDPA tracing w/o attention_mask (not sure how feasible this is).

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-01-22T13:29:39Z

cc @fxmarty

fxmarty · 2024-01-22T15:02:27Z

Thank you for the ping, thank you @BowenBao. cc @drisspg and linking relevant issues as well: pytorch/pytorch#110681 & pytorch/pytorch#108108

Solution 3. SDPA tracing without attention_mask is I think not possible due to the data-dependent controlflow here:

transformers/src/transformers/models/llama/modeling_llama.py

Line 735 in e201864

is_causal=self.is_causal and attention_mask is None and q_len > 1,

q_len > 1. The reason for this controlflow is that SDPA attention_mask from is_causal is top-left aligned.

The same issue exists when tracing SDPA with symbolic_trace or with dynamo + fullgraph=True (https://pytorch.slack.com/archives/C033H6DJSJU/p1702029349053049?thread_ts=1694790001.945579&cid=C033H6DJSJU).

Solution 1. is what the error suggests. I don't think it would be easy to implement (would need to have torch.jit.is_tracing() controlflow that does magic on the model).

Solution 2. is probably the most doable (we would need to look at pad tokens). Currently we try as much as possible to pass a attn_mask=None since SDPA is able to dispatch to some mem-efficient attention & flash attention path only in the case. We already avoid setting the attention_mask to None in case we are tracing:

transformers/src/transformers/modeling_attn_mask_utils.py

Lines 371 to 381 in e201864

    
           elif not is_tracing and torch.all(attention_mask == 1): 
        
               if query_length == 1: 
        
                   # For query_length == 1, causal attention and bi-directional attention are the same. 
        
                   attention_mask = None 
        
               elif key_value_length == query_length: 
        
                   attention_mask = None 
        
               else: 
        
                   # Unfortunately, for query_length > 1 and key_value_length != query_length, we cannot generally ignore the attention mask, as SDPA causal mask generation 
        
                   # may be wrong. We will set `is_causal=False` in SDPA and rely on Transformers attention_mask instead, hence not setting it to None here. 
        
                   # Reference: https://github.com/pytorch/pytorch/issues/108108 
        
                   pass

thiagocrepaldi · 2024-02-01T19:02:02Z

Thank you for the ping, thank you @BowenBao. cc @drisspg and linking relevant issues as well: pytorch/pytorch#110681 & pytorch/pytorch#108108

Solution 3. SDPA tracing without attention_mask is I think not possible due to the data-dependent controlflow here:

transformers/src/transformers/models/llama/modeling_llama.py

Line 735 in e201864

is_causal=self.is_causal and attention_mask is None and q_len > 1,

q_len > 1. The reason for this controlflow is that SDPA attention_mask from is_causal is top-left aligned.
The same issue exists when tracing SDPA with symbolic_trace or with dynamo + fullgraph=True (https://pytorch.slack.com/archives/C033H6DJSJU/p1702029349053049?thread_ts=1694790001.945579&cid=C033H6DJSJU).

Solution 1. is what the error suggests. I don't think it would be easy to implement (would need to have torch.jit.is_tracing() controlflow that does magic on the model).

Solution 2. is probably the most doable (we would need to look at pad tokens). Currently we try as much as possible to pass a attn_mask=None since SDPA is able to dispatch to some mem-efficient attention & flash attention path only in the case. We already avoid setting the attention_mask to None in case we are tracing:

transformers/src/transformers/modeling_attn_mask_utils.py

Lines 371 to 381 in e201864

elif not is_tracing and torch.all(attention_mask == 1):

if query_length == 1:

# For query_length == 1, causal attention and bi-directional attention are the same.

attention_mask = None

elif key_value_length == query_length:

attention_mask = None

else:

# Unfortunately, for query_length > 1 and key_value_length != query_length, we cannot generally ignore the attention mask, as SDPA causal mask generation

# may be wrong. We will set `is_causal=False` in SDPA and rely on Transformers attention_mask instead, hence not setting it to None here.

# Reference: https://github.com/pytorch/pytorch/issues/108108

pass

Hi @fxmarty why Solution 1 is not easy to implement? I was thining something like

if torch.jit.is_tracing():
    attn_mask = old_attention()
else:
    attn_mask = new_sdpa_attention()

BowenBao · 2024-02-01T19:17:52Z

Thanks for your reply and context @fxmarty

I have a local fix using solution 1, will put up a PR to unblock exporter in the short term, while waiting on pytorch/pytorch#108108 .

LoicDagnas · 2024-02-19T17:23:56Z

@fxmarty I might be wrong but, installing from the latest source, I still have the same issue for BART based model export without attention_mask. Is is something planned to be supported?

fxmarty · 2024-02-26T14:46:43Z

@LoicDagnas Yes, it is expected. If you want to trace the model without an attention_mask input, you should load your model with the argument attn_implementation="eager" passed to from_pretrained, as suggested in the error that should be raised:

transformers/src/transformers/modeling_attn_mask_utils.py

Lines 391 to 393 in ece1b62

    
           raise ValueError( 
        
               'Attention using SDPA can not be traced with torch.jit.trace when no attention_mask is provided. To solve this issue, please either load your model with the argument `attn_implementation="eager"` or pass an attention_mask input when tracing the model.' 
        
           )

Note: this is due to the following controlflow

transformers/src/transformers/models/bart/modeling_bart.py

Line 597 in ece1b62

is_causal=self.is_causal and attention_mask is None and tgt_len > 1,

See for reference pytorch/pytorch#110681 & pytorch/pytorch#108108

BowenBao mentioned this issue Feb 1, 2024

llama_v2_7b_16h stopped working with torch.jit.trace pytorch/pytorch#117752

Closed

BowenBao mentioned this issue Feb 1, 2024

Unblock Llama2 ONNX export w/ sdpa by falling back to manual impl #28823

Closed

5 tasks

ArthurZucker mentioned this issue Feb 5, 2024

[Core generation] Adds support for static KV cache #27931

Merged

4 tasks

ArthurZucker closed this as completed in #27931 Feb 8, 2024

contentis mentioned this issue Mar 26, 2024

TensorRT support lllyasviel/stable-diffusion-webui-forge#64

Open

6 tasks

SangbumChoi mentioned this issue Jul 26, 2024

Add sdpa and FA2 for CLIP #31940

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX export failure for models invoking SDPA attention #28610

ONNX export failure for models invoking SDPA attention #28610

BowenBao commented Jan 19, 2024

amyeroberts commented Jan 22, 2024

fxmarty commented Jan 22, 2024

thiagocrepaldi commented Feb 1, 2024

BowenBao commented Feb 1, 2024

LoicDagnas commented Feb 19, 2024

fxmarty commented Feb 26, 2024 •

edited

Loading

ONNX export failure for models invoking SDPA attention #28610

ONNX export failure for models invoking SDPA attention #28610

Comments

BowenBao commented Jan 19, 2024

amyeroberts commented Jan 22, 2024

fxmarty commented Jan 22, 2024

thiagocrepaldi commented Feb 1, 2024

BowenBao commented Feb 1, 2024

LoicDagnas commented Feb 19, 2024

fxmarty commented Feb 26, 2024 • edited Loading

fxmarty commented Feb 26, 2024 •

edited

Loading