Why we need triton_max_batch_size and runtime engine mac_batch_size seperately? #753
Unanswered
sushilkumar-yadav
asked this question in
Q&A
Replies: 0 comments
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
-
I recently tested the batching feature using the Triton Inference Server. Below are the steps I followed.
I'm wondering about the purpose of the max_batch_size parameter under the preprocessing section. I don't see any change in the output when I modify the max_batch_size value in preprocessing. However, when I change the max_batch_size during engine building (trt_build command), I do observe a difference in latency.
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}
--remove_input_padding enable
--gpt_attention_plugin float16
--context_fmha enable
--gemm_plugin float16
--output_dir ${ENGINE_DIR}
--paged_kv_cache enable
--max_batch_size 2048
python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=/engines/llama3.1-8b-instruct/1-gpu/ --max_output_len 50 --tokenizer_dir /Llama-3.1-8B-Instruct-hf --input_text "What is ML?"
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
preprocessing
TOKENIZER_DIR=/Llama-3.1-8B-Instruct-hf/
TOKENIZER_TYPE=auto
ENGINE_DIR=/engines/llama3.1-8b-instruct/1-gpu/
DECOUPLED_MODE=true
MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm
MAX_BATCH_SIZE=16
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=100
TRITON_BACKEND=tensorrtllm
LOGITS_DATATYPE="TYPE_FP32"
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}
Thanks
Beta Was this translation helpful? Give feedback.
All reactions