-
Notifications
You must be signed in to change notification settings - Fork 43
VLM Pipeline (Intern,LLava) #256
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Signed-off-by: Dipankar Sarkar <quic_dipankar@quicinc.com>
return generate_func(**kwargs) | ||
|
||
def generate_inputs_intern(self, **kwargs): | ||
bs: int = constants.ONNX_EXPORT_EXAMPLE_BATCH_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this inside the modeling file.
return inputs, dynamic_axes, output_names | ||
|
||
def generate_inputs_llava(self, **kwargs): | ||
bs: int = constants.ONNX_EXPORT_EXAMPLE_BATCH_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this inside the modeling file modelling_llava.py
# ) | ||
# num_logits_to_keep = num_speculative_tokens + 1 | ||
# if prefill_seq_len < num_logits_to_keep: | ||
# raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented lines.
generation_len = self.ctx_len - input_len.max() # in standalone this is tensor | ||
assert generation_len > 0, "generation length should be greater than zero" | ||
generated_ids = np.full((batch_size, generation_len + 1), self.processor.tokenizer.pad_token_id) | ||
# inputs["input_ids"]=torch.nn.functional.pad(inputs["input_ids"],(0,self.seq_len_constant-inputs["input_ids"].size(1)),"constant",self.pad_token_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
PROMPT_LEN = 8 | ||
INPUT_STR = ["My name is"] | ||
GB = 2**30 | ||
MAX_QPC_LIMIT = 30 | ||
MAX_RETRIES = 5 # This constant will be used set the maximum number of retry attempts for downloading a model using huggingface_hub snapshot_download | ||
NUM_SPECULATIVE_TOKENS = 2 | ||
CTX_LEN_VLM_LLAVA = 1280 | ||
IMG_SIZE = 336 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using these at the time of export to define shapes?
|
||
|
||
# if __name__ == "__main__": | ||
# # model_name = "OpenGVLab/InternVL2_5-1B" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented parts.
@@ -251,6 +252,7 @@ def _compile( | |||
if num_speculative_tokens: | |||
compile_hash.update(to_hashable({"num_speculative_tokens": num_speculative_tokens})) | |||
|
|||
# import ipdb; ipdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these lines.
if hasattr(module, "__qeff_init__"): | ||
module.__qeff_init__() | ||
transformed = True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we combine both if conditions?
input_ids_size = input_ids.shape[1] | ||
# attention_mask = inputs["attention_mask"] | ||
inputs["input_ids"] = torch.nn.functional.pad( | ||
inputs["input_ids"], (0, 3072 - input_ids_size), "constant", self.processor.tokenizer.pad_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid hardcoded value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this value generic, and fetch from qpc session - prefill_seq_len
. For whichever value it was compiled for.
breakpoint() | ||
self.model.config.use_cache = True | ||
self.processor = processor | ||
self.num_layers = model.config.text_config.num_hidden_layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make fetching num_layers
generic. Also the padding shape. Please refer llava PR, use a similar function, which fetches based on model architecture.
Already addressed in #267. |
Added Generic Framework to onboard and run VLMs in QEff