Speculative Decoding - Draft Target model approach - Having issue with Triton inference Server #2709

sivabreddy · 2025-01-21T15:24:59Z

Any Help.....

Tried deploying Llama3.1-8b as draft model, Llama3.3-70b as target model.
followed all the steps mention here.
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md

The Server is not getting up and running.

container Image: nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
Version of tensorrt-llm 0.15.0
Version of tensorrt-llm backend 0.15.0

Here i'm sharing details of commands used for building engines and the log.

quantize draft model

python3 /data/TensorRT-LLM/examples/quantization/quantize.py
--model_dir /data/llama3-1-8b
--dtype bfloat16
--qformat fp8
--kv_cache_dtype fp8
--output_dir /data/llama3-1-8b/ckpt_draft
--calib_size 512
--tp_size 1

quantize target model

python3 /data/TensorRT-LLM/examples/quantization/quantize.py
--model_dir /data/llama3-3-70b
--dtype bfloat16
--qformat fp8
--kv_cache_dtype fp8
--output_dir /data/llama3-3-70b/ckpt_target
--calib_size 512
--tp_size 1

build engines

draft engine

trtllm-build
--checkpoint_dir=/data/llama3-1-8b/ckpt_draft
--output_dir=/data/llama3-1-8b/draft_engine
--max_batch_size=1
--max_input_len=2048
--max_seq_len=3072
--gpt_attention_plugin=bfloat16
--gemm_plugin=fp8
--remove_input_padding=enable
--kv_cache_type=paged
--context_fmha=enable
--use_paged_context_fmha=enable
--gather_generation_logits
--use_fp8_context_fmha=enable`

target engine

trtllm-build
--checkpoint_dir=/data/llama3-3-70b/ckpt_target
--output_dir=/data/llama3-3-70b/target_engine
--max_batch_size=1
--max_input_len=2048
--max_seq_len=3072
--gpt_attention_plugin=bfloat16
--gemm_plugin=fp8
--remove_input_padding=enable
--kv_cache_type=paged
--context_fmha=enable
--use_paged_context_fmha=enable
--gather_generation_logits
--use_fp8_context_fmha=enable
--max_draft_len=10
--speculative_decoding_mode=draft_tokens_external

ACCUMULATE_TOKEN="false"
BACKEND="tensorrtllm"
BATCH_SCHEDULER_POLICY="guaranteed_no_evict"
BATCHING_STRATEGY="inflight_fused_batching"
BLS_INSTANCE_COUNT="1"
DECODING_MODE="top_k_top_p"
DECOUPLED_MODE="False"
DRAFT_GPU_DEVICE_IDS="0"
E2E_MODEL_NAME="ensemble"
ENABLE_KV_CACHE_REUSE="true"
ENGINE_PATH=/data/llama3-3-70b/target_engine
EXCLUDE_INPUT_IN_OUTPUT="false"
KV_CACHE_FREE_GPU_MEM_FRACTION="0.95"
MAX_BEAM_WIDTH="1"
MAX_QUEUE_DELAY_MICROSECONDS="0"
NORMALIZE_LOG_PROBS="true"
POSTPROCESSING_INSTANCE_COUNT="1"
PREPROCESSING_INSTANCE_COUNT="1"
TARGET_GPU_DEVICE_IDS="1"
TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"
TOKENIZER_PATH=/data/llama3-1-8b
TOKENIZER_TYPE=llama
TRITON_GRPC_PORT="8001"
TRITON_HTTP_PORT="8000"
TRITON_MAX_BATCH_SIZE="4"
TRITON_METRICS_PORT="8002"
TRITON_REPO="triton_repo"
USE_DRAFT_LOGITS="false"
DRAFT_ENGINE_PATH=/data/llama3-1-8b/draft_engine
ENABLE_CHUNKED_CONTEXT="true"
MAX_TOKENS_IN_KV_CACHE=""
MAX_ATTENTION_WINDOW_SIZE=""

Make a copy of triton repo and replace the fields in the configuration files

Prepare model repository for a TensorRT-LLM model

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
git checkout v0.15.0

apt-get update && apt-get install -y build-essential cmake git-lfs
pip3 install git-lfs tritonclient grpcio
rm -rf ${TRITON_REPO}
cp -R all_models/inflight_batcher_llm ${TRITON_REPO}

python3 tools/fill_template.py -i ${TRITON_REPO}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}

python3 tools/fill_template.py -i ${TRITON_REPO}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${PREPROCESSING_INSTANCE_COUNT}

python3 tools/fill_template.py -i ${TRITON_REPO}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${POSTPROCESSING_INSTANCE_COUNT}

python3 tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},accumulate_tokens:${ACCUMULATE_TOKEN},bls_instance_count:${BLS_INSTANCE_COUNT},tensorrt_llm_model_name:${TENSORRT_LLM_MODEL_NAME},tensorrt_llm_draft_model_name:${TENSORRT_LLM_DRAFT_MODEL_NAME}

Make a copy of tensorrt_llm as configurations of draft / target models.

cp -R ${TRITON_REPO}/tensorrt_llm ${TRITON_REPO}/tensorrt_llm_draft
sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_draft"/g' ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt

python3 tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${TARGET_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16

python3 tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt triton_backend:${BACKEND},engine_dir:${DRAFT_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${DRAFT_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16

root@triton-spec-decode-64884b776d-k9dnc:/data# python3 /data/tensorrtllm_backend/scripts/launch_triton_server.py
--model_repo=/data/tensorrtllm_backend/triton_repo
--tensorrt_llm_model_name "tensorrt_llm_draft,tensorrt_llm"
--multi-model
--log
--log-file /data/tensorrtllm_backend/triton_server.log
&
[1] 43399
root@triton-spec-decode-64884b776d-k9dnc:/data# [TensorRT-LLM][INFO] Using GPU device ids: 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.15.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Engine version 0.15.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Using user-specified devices: (1)
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Using user-specified devices: (1)
[TensorRT-LLM][INFO] Rank 0 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 3082
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 10
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (3082) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 3072
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 3081 = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 3082 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Using GPU device ids: 0
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.15.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
[TensorRT-LLM][INFO] Engine version 0.15.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Using user-specified devices: (0)
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Using user-specified devices: (0)
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 3072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (3072) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 3072
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 3071 = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 3072 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] 'max_num_images' parameter is not set correctly (value is ${max_num_images}). Will be set to None
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][INFO] Loaded engine size: 8730 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 294.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8724 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.00 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.32 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.10 GiB, available: 69.15 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 16819
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 48
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 65.70 GiB for max tokens in paged KV cache (1076416).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][INFO] Loaded engine size: 69369 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 540.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 69354 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.69 MB GPU memory for runtime buffers.
[TensorRT-LLM][WARNING] Overwriting decoding mode to external draft token
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 12.40 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.10 GiB, available: 10.21 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 994
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 49
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 9.71 GiB for max tokens in paged KV cache (63616).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)`

after this there is nothing on console. But i see the GPUs are loaded with weights.

GPU details:

Every 1.0s: nvidia-smi triton-spec-decode-64884b776d-k9dnc: Sun Jan 19 14:41:47 2025

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2025-01-21T15:38:59Z

Hi @sivabreddy thanks for reporting this issue. Would u please use the latest main branch commit or 0.16 release to verify the whole process?
Meanwhile ,the team will take a look and try to reproduce it with the coming 0.17 release.

sivabreddy · 2025-01-22T13:30:26Z

Hi @nv-guomingz , Thank you for your response.
Tested on
container Image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
Version of tensorrt-llm 0.16.0
Version of tensorrt-llm backend 0.16.0.

Still the issue is same with triton inference server.

I'm seeing response from the engine, when i run the below command.

mpirun -n 1 --allow-run-as-root python3 /data/TensorRT-LLM/examples/run.py\
     --tokenizer_dir /data/llama3-1-8b\
     --draft_engine_dir /data/llama3-1-8b/draft_engine\
     --engine_dir /data/llama3-3-70b/target_engine\
     --draft_target_model_config="[10,[0],[1],False]"\
     --kv_cache_free_gpu_memory_fraction=0.95\
     --max_output_len=1024\
     --kv_cache_enable_block_reuse\
     --input_text="what is Newtons third law"

Input [Text 0]: "<|begin_of_text|>what is Newtons third law"
Output [Text 0 Beam 0]: " of motion?
Newton's third law of motion states that for every action, there is an equal and opposite reaction. This means that when an object exerts a force on another object, the second object will exert an equal and opposite force on the first object. In other words, forces always occur in pairs - equal and opposite action-reaction force pairs. This law applies to all interactions between objects, whether they are physical or non-physical. It is often summarized as "for every action, there is an equal and opposite reaction". This law is a fundamental principle in physics and engineering, and it has numerous applications in fields such as mechanics, electromagnetism, and quantum mechanics. Examples of Newton's third law of motion include: * When you push on a wall, the wall pushes back on you with an equal force. * When you throw a ball, the ball exerts an equal force on your hand. * When a car accelerates forward, the ground exerts an equal force on the car in the opposite direction. * When a rocket launches into space, the hot gases expelled from the back of the rocket exert an equal force on the rocket in the forward direction. Newton's third law of motion is often used to explain the behavior of objects in motion, and it is a key concept in understanding the fundamental laws of physics. It is named after Sir Isaac Newton, who first formulated the law in the late 17th century. Newton's third law of motion is often mathematically expressed as: F1 = -F2 Where F1 is the force exerted by object 1 on object 2, and F2 is the force exerted by object 2 on object 1. The negative #dicates that the forces are in opposite directions. This law is a fundamental principle in physics and engineering, and it has numerous applications in fields such as mechanics, electromagnetism, and quantum mechanics. Examples of Newton's third law of motion include: * When you push on a wall, the wall pushes back on you with an equal force. * When you throw a ball, the ball exerts an equal force on your hand. * When a car accelerates forward, the ground exerts an equal force on the car in the opposite direction. * When a rocket launches into space, the hot gases expelled from the back of the rocket exert an equal force on the rocket in the forward direction. Newton's third law of motion is often used to explain the behavior of objects in motion, and it is a key concept in understanding the fundamental laws of physics. It is named after Sir Isaac Newton, who first formulated the law in the late 17th century. Newton's third law of motion is often mathematically expressed as: F1 = -F2 Where F1 is the force exerted by object 1 on object 2, and F2 is the force exerted by object 2 on object 1. The negative #dicates that the forces are in opposite directions. This law is a fundamental principle in physics and engineering, and it has numerous applications in fields such as mechanics, electromagnetism, and quantum mechanics. Examples of Newton's third law of motion include: * When you push on a wall, the wall pushes back on you with an equal force. * When you throw a ball, the ball exerts an equal force on your hand. * When a car accelerates forward, the ground exerts an equal force on the car in the opposite direction. * When a rocket launches into space, the hot gases expelled from the back of the rocket exert an equal force on the rocket in the forward direction. Newton's third law of motion is often used to explain the behavior of objects in motion, and it is a key concept in understanding the fundamental laws of physics. It is named after Sir Isaac Newton, who first formulated the law in the late 17th century. Newton's third law of motion is often mathematically expressed as: F1 = -F2 Where F1 is the force exerted by object 1 on object 2, and F2 is the force exerted by object 2 on object 1. The negative #dicates that the forces are in opposite directions. This law is a fundamental principle in physics and engineering, and it has numerous applications in fields such as mechanics, electromagnetism, and quantum mechanics. Examples of Newton's third law of motion include: * When you push on a wall, the wall pushes back on you with an equal force. * When you throw a ball, the ball exerts an equal force on your hand. * When a car accelerates forward, the ground exerts an equal force on the car in the opposite direction. * When a rocket launches into space, the hot gases expelled from the back of the rocket exert an equal force on the rocket in the forward direction. Newton's third law of motion is often used to explain the behavior of objects in motion, and it is a key concept in understanding the fundamental laws of physics. It is named after Sir Isaac Newton,"

Will you please help/verify with the model repo config files.
This is the documentation i followed for deploying on triton inference server.
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md

Thanks.

nv-guomingz · 2025-01-22T14:22:15Z

Hi @pcastonguay could u please take a look this issue?

sivabreddy · 2025-01-27T06:44:21Z

Hi @pcastonguay , I wanted to follow up on the issue shared here earlier. Have you had a chance to look into the issue? any solutions or workarounds? Your insights would be greatly appreciated.
Thanks

nv-guomingz added Triton Backend triaged Issue has been triaged by maintainers labels Jan 22, 2025

nv-guomingz assigned pcastonguay Jan 22, 2025

github-actions bot added the Investigating label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative Decoding - Draft Target model approach - Having issue with Triton inference Server #2709

Speculative Decoding - Draft Target model approach - Having issue with Triton inference Server #2709

sivabreddy commented Jan 21, 2025 •

edited

Loading

nv-guomingz commented Jan 21, 2025

sivabreddy commented Jan 22, 2025 •

edited

Loading

nv-guomingz commented Jan 22, 2025

sivabreddy commented Jan 27, 2025

Speculative Decoding - Draft Target model approach - Having issue with Triton inference Server #2709

Speculative Decoding - Draft Target model approach - Having issue with Triton inference Server #2709

Comments

sivabreddy commented Jan 21, 2025 • edited Loading

quantize draft model

quantize target model

build engines

draft engine

target engine

Make a copy of triton repo and replace the fields in the configuration files

Prepare model repository for a TensorRT-LLM model

Make a copy of tensorrt_llm as configurations of draft / target models.

GPU details:

nv-guomingz commented Jan 21, 2025

sivabreddy commented Jan 22, 2025 • edited Loading

nv-guomingz commented Jan 22, 2025

sivabreddy commented Jan 27, 2025

sivabreddy commented Jan 21, 2025 •

edited

Loading

sivabreddy commented Jan 22, 2025 •

edited

Loading