Add all_models/bert as an example for tensorrt-llm classification models #269

erenup · 2023-12-31T17:56:42Z

Hi @kaiyux @Shixiaowei02
First of all, Thank you very much for the great tensorrtllm_backend!

Pull Request Topic

This PR is trying to make an example of tensorrt-llm classification models. I understand the tensorrtllm is mainly for generation tasks. However, the community can also benefit from the amazing tensorrt llm in classification tasks which is also a fundamental NLP task. Hope it could be helpful!

Features

This feature is very related to my PR in TensorRT-LLM Add Roberta and few new tests for Bert. The code of PR in TensorRT-LLM can produce classification tensorrt engines for this tensorrtllm_backend PR.
I implemented the example classification models under all_models/bert. It contains three sub-directories including preprocessing, tensorrt_llm, and ensemble.
preprocessing is similar to all_models/gpt/preprocessing. but I removed unrelated parameters.
tensorrt_llm is similar to all_models/gpt/tensorrt_llm. but I directly load the engine in the model.py since classification is simpler than generation tasks.
ensemble is similar to all_models/gpt/ensemble. but I removed unrelated parameters.
parameters need to be modified or mentioned in README: ${engine_dir} inall_models/bert/tensorrt_llm/config.pbtxt and ${tokenizer_dir} inall_models/bert/preprocessing/config.pbtxt . I did not modify the main readme.md in this repo since it may be better for you to organize this tensorrtllm_backend.

Tests

I tested the backend in my 4080 GPU. I find the speed is amazing!
My simple speed test results with the engine built with -use_gemm_plugin float16 --use_bert_attention_plugin float16 --enable_context_fmha are provided below. I also provide my simple test python script below.

- When the use_gemm_plugin and use_bert_attention_plugin are False, the speed is about half slower.

Simple Speed Tests script

import requests
import multiprocessing
import concurrent.futures
import time

# Configuration
SERVER_URL = "http://localhost:8000/v2/models/ensemble/generate"
NUM_REQUESTS = 1000  # Number of requests to send
MAX_WORKERS = multiprocessing.cpu_count()    # Number of concurrent workers
print(f'MAX_WORKERS: {MAX_WORKERS}')
def send_request():
    data = '{"text_input": "This is tensorrt-llm for bert and roberta sequence classification models!", "bad_words": "", "stop_words": ""}'
    response = requests.post(SERVER_URL, data=data)
    return response

def main():
    with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        start_time = time.time()
        futures = [executor.submit(send_request) for _ in range(NUM_REQUESTS)]
        concurrent.futures.wait(futures)
        end_time = time.time()

    # Calculate and print results
    total_time = end_time - start_time
    print(f"Total time for {NUM_REQUESTS} requests: {total_time} seconds")
    print(f"Average time per request: {total_time / NUM_REQUESTS} seconds")
    print(f"Requests per second: {NUM_REQUESTS / total_time}")

if __name__ == "__main__":
    main()

Hope this PR could be useful for the community!

Happy New Year!

support bert/roberta/tensorrt-llm classification models

7d36c29

erenup mentioned this pull request Jan 4, 2024

Add Roberta and few new tests for Bert NVIDIA/TensorRT-LLM#778

Closed

symphonylyh self-assigned this Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add all_models/bert as an example for tensorrt-llm classification models #269

Add all_models/bert as an example for tensorrt-llm classification models #269

erenup commented Dec 31, 2023

Add all_models/bert as an example for tensorrt-llm classification models #269

Are you sure you want to change the base?

Add all_models/bert as an example for tensorrt-llm classification models #269

Conversation

erenup commented Dec 31, 2023

Pull Request Topic

Features

Tests

Simple Speed Tests script