Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[PoC] Improve TRTLLM deployment UX #650

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

[PoC] Improve TRTLLM deployment UX #650

wants to merge 3 commits into from

Conversation

rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Nov 22, 2024

Changes

  • Remove mandatory template values in configs with some sensible defaults
  • Support building TRTLLM engine on model load if none found using LLM API
  • Add env vars for conveniently configuring engine and tokenizers from a single location instead of specifying it in all the model configs

Example Usage

Quickstart - no engine, no tokenizer, build on demand

# Launch TRTLLM container
docker run -ti \
    --gpus all \
    --network=host \
    --shm-size=1g \
    --ulimit memlock=-1 \
    -e HF_TOKEN \
    -v ${HOME}:/mnt \
    -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
    nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3

# Clone these changes
git clone -b rmccormick/ux https://github.com/triton-inference-server/tensorrtllm_backend.git

# Specify directory for engines and tokenizer config to either be read from, or written to
export TRTLLM_ENGINE_DIR="/tmp/hackathon"
# Specify model to build if TRTLLM_ENGINE_DIR has no engines
export TRTLLM_MODEL="meta-llama/Meta-Llama-3.1-8B-Instruct"
# Workaround to support HF Tokenizer while engine is being built on demand to avoid
# ordering issues with model loading, or if tokenizer exists in a different location.
export TRTLLM_TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"

# Start server
tritonserver --model-repository ./tensorrtllm_backend/all_models/inflight_batcher_llm

Pre-built engine + tokenizer already in same location

export TRTLLM_ENGINE_DIR="/tmp/hackathon"

# Start server
tritonserver --model-repository ./tensorrtllm_backend/all_models/inflight_batcher_llm

Pre-built engine + tokenizer in different locations

export TRTLLM_ENGINE_DIR="/tmp/hackathon"
export TRTLLM_TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"

# Start server
tritonserver --model-repository ./tensorrtllm_backend/all_models/inflight_batcher_llm

Further customization/configuration

Manually tune/configure values in the config.pbtxt files as needed for tuning Triton or TRT-LLM runtime fields.

Open Items

  • Ordering: If engine and tokenizer don't exist, and preprocessing/postprocessing models load before tensorrt_llm model builds engine and downloads tokenizer, then they will fail to load with no tokenizer found.
    • Added TRTLLM_TOKENIZER env var as a WAR for the ordering issue for now.
  • Support building engine from a TRTLLM-generated config.json if config is found but engines are not
  • Support configuring more TRTLLM backend/runtime fields from the engine's config.json
  • Test multi-gpu engine (ex: Llama 70B)
  • Re-use common logic around tokenizer / env vars in preprocessing and postprocessing models
[Extra] Probably not in scope for this PR, but there is also a Python Model shutdown segfault
[ced35d0-lcedt:2992 :0:2992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x60)
==== backtrace (tid:   2992) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000b44d0 triton::backend::python::Metric::SaveToSharedMemory()  :0
 2 0x00000000000b536e triton::backend::python::Metric::Clear()  :0
 3 0x00000000000b9291 triton::backend::python::MetricFamily::~MetricFamily()  :0
 4 0x00000000000677f2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()  ???:0
 5 0x000000000006822a pybind11::class_<triton::backend::python::MetricFamily, std::shared_ptr<triton::backend::python::MetricFamily> >::dealloc()  ???:0
 6 0x0000000000037b5d pybind11::detail::clear_instance()  :0
 7 0x0000000000038b13 pybind11_object_dealloc()  ???:0
 8 0x000000000011bea5 PyODict_DelItem()  ???:0
 9 0x0000000000144b37 PyType_GenericAlloc()  ???:0
10 0x000000000005142e triton::backend::python::Stub::~Stub()  :0
11 0x0000000000028f53 main()  ???:0
12 0x0000000000029d90 __libc_init_first()  ???:0
13 0x0000000000029e40 __libc_start_main()  ???:0
14 0x0000000000029db5 _start()  ???:0
=================================
I1122 22:08:07.287915 2117 model_lifecycle.cc:624] "successfully unloaded 'tensorrt_llm' version 1"

…TLLM engine on model load if none found, add env vars for conveniently configuring engine and tokenizers from a single location
@rmccorm4 rmccorm4 marked this pull request as draft November 22, 2024 21:37
@@ -330,7 +330,7 @@ parameters: {

instance_group [
{
count: ${bls_instance_count}
count: 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Should evaluate what reasonable instance count defaults are

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant