A high-throughput and memory-efficient inference and serving engine for LLMs
amd cuda inference pytorch transformer llama gpt rocm model-serving tpu hpu mlops xpu llm inferentia llmops llm-serving trainium
-
Updated
Dec 25, 2024 - Python