Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add ccache to amd #5555

Merged
merged 2 commits into from
Jun 15, 2024
Merged

Add ccache to amd #5555

merged 2 commits into from
Jun 15, 2024

Conversation

simon-mo
Copy link
Collaborator

This should help reduce the recompilation each run leading to 40+ minutes build time,.

@comaniac comaniac merged commit bd7efe9 into vllm-project:main Jun 15, 2024
66 checks passed
robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 23, 2024
kzawora-intel added a commit to HabanaAI/vllm-fork that referenced this pull request Jul 2, 2024
* [Hardware][Intel] Optimize CPU backend and add more performance tips (vllm-project#4971)

Co-authored-by: Jianan Gu <jianan.gu@intel.com>

* [Docs] Add 4th meetup slides (vllm-project#5509)

* [Misc] Add vLLM version getter to utils (vllm-project#5098)

* [CI/Build] Simplify OpenAI server setup in tests (vllm-project#5100)

* [Doc] Update LLaVA docs (vllm-project#5437)

Co-authored-by: Roger Wang <ywang@roblox.com>

* [Kernel] Factor out epilogues from cutlass kernels (vllm-project#5391)

Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

* [MISC] Remove FP8 warning (vllm-project#5472)

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>

* Seperate dev requirements into lint and test (vllm-project#5474)

* Revert "[Core] Remove unnecessary copies in flash attn backend" (vllm-project#5478)

* [misc] fix format.sh (vllm-project#5511)

* [CI/Build] Disable test_fp8.py (vllm-project#5508)

* [Kernel] Disable CUTLASS kernels for fp8 (vllm-project#5505)

* Add `cuda_device_count_stateless` (vllm-project#5473)

* [Hardware][Intel] Support CPU inference with AVX2 ISA (vllm-project#5452)

* [Misc] Fix arg names in quantizer script (vllm-project#5507)

* bump version to v0.5.0.post1 (vllm-project#5522)

* [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (vllm-project#5073)

Co-authored-by: simon-mo <simon.mo@hey.com>

* [CI/Build] Disable LLaVA-NeXT CPU test (vllm-project#5529)

* [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (vllm-project#5516)

* [Misc] Fix arg names (vllm-project#5524)

* [ Misc ] Rs/compressed tensors cleanup (vllm-project#5432)

Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

* [Kernel] Suppress mma.sp warning on CUDA 12.5 and later (vllm-project#5401)

* [mis] fix flaky test of test_cuda_device_count_stateless (vllm-project#5546)

* [Core] Remove duplicate processing in async engine (vllm-project#5525)

* [misc][distributed] fix benign error in `is_in_the_same_node` (vllm-project#5512)

* [Docs] Add ZhenFund as a Sponsor (vllm-project#5548)

* [Doc] Update documentation on Tensorizer (vllm-project#5471)

* [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models  (vllm-project#5460)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [Bugfix] Fix typo in Pallas backend (vllm-project#5558)

* [Core][Distributed] improve p2p cache generation (vllm-project#5528)

* Add ccache to amd (vllm-project#5555)

* [Core][Bugfix]: fix prefix caching for blockv2 (vllm-project#5364)

Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>

* [mypy] Enable type checking for test directory (vllm-project#5017)

* [CI/Build] Test both text and token IDs in batched OpenAI Completions API (vllm-project#5568)

* [misc] Do not allow to use lora with chunked prefill. (vllm-project#5538)

Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

* add gptq_marlin test for bug report vllm-project#5088 (vllm-project#5145)

* [BugFix] Don't start a Ray cluster when not using Ray (vllm-project#5570)

* [Fix] Correct OpenAI batch response format (vllm-project#5554)

* Add basic correctness 2 GPU tests to 4 GPU pipeline (vllm-project#5518)

* [CI][BugFix] Flip is_quant_method_supported condition (vllm-project#5577)

* [build][misc] limit numpy version (vllm-project#5582)

* [Doc] add debugging tips for crash and multi-node debugging (vllm-project#5581)

* Fix w8a8 benchmark and add Llama-3-8B (vllm-project#5562)

* [Model] Rename Phi3 rope scaling type (vllm-project#5595)

* Correct alignment in the seq_len diagram. (vllm-project#5592)

Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai>

* [Kernel] `compressed-tensors` marlin 24 support (vllm-project#5435)

* [Misc] use AutoTokenizer for benchmark serving when vLLM not installed (vllm-project#5588)

* [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (vllm-project#3814)

Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* [CI/BUILD] Support non-AVX512 vLLM building and testing (vllm-project#5574)

* [CI] the readability of benchmarking and prepare for dashboard (vllm-project#5571)

[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (vllm-project#5571)

* [bugfix][distributed] fix 16 gpus local rank arrangement (vllm-project#5604)

* [Optimization] use a pool to reuse LogicalTokenBlock.token_ids (vllm-project#5584)

* [Bugfix] Fix KV head calculation for MPT models when using GQA (vllm-project#5142)

* [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (vllm-project#5606)

* [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (vllm-project#5131)

* [Model] Initialize Phi-3-vision support (vllm-project#4986)

* [Kernel] Add punica dimensions for Granite 13b (vllm-project#5559)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

* [misc][typo] fix typo (vllm-project#5620)

* [Misc] Fix typo (vllm-project#5618)

* [CI] Avoid naming different metrics with the same name in performance benchmark (vllm-project#5615)

* [bugfix][distributed] improve p2p capability test (vllm-project#5612)

[bugfix][distributed] do not error if two processes do not agree on p2p capability (vllm-project#5612)

* [Misc] Remove import from transformers logging (vllm-project#5625)

* [CI/Build][Misc] Update Pytest Marker for VLMs (vllm-project#5623)

* [ci] Deprecate original CI template (vllm-project#5624)

Signed-off-by: kevin <kevin@anyscale.com>

* [Misc] Add OpenTelemetry support (vllm-project#4687)

This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here

* [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (vllm-project#5542)

* [ci] Setup Release pipeline and build release wheels with cache (vllm-project#5610)

Signed-off-by: kevin <kevin@anyscale.com>

* [Model] LoRA support added for command-r (vllm-project#5178)

* [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties  (vllm-project#5639)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [Doc] Added cerebrium as Integration option (vllm-project#5553)

* [Bugfix] Fix CUDA version check for mma warning suppression (vllm-project#5642)

* [Bugfix] Fix w8a8 benchmarks for int8 case (vllm-project#5643)

* [Bugfix] Fix Phi-3 Long RoPE scaling implementation (vllm-project#5628)

* [Bugfix] Added test for sampling repetition penalty bug. (vllm-project#5659)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices (vllm-project#5641)

* [misc][distributed] use 127.0.0.1 for single-node (vllm-project#5619)

* [Model] Add FP8 kv cache for Qwen2 (vllm-project#5656)

* [Bugfix] Fix sampling_params passed incorrectly in Phi3v example (vllm-project#5684)

* [Misc]Add param max-model-len in benchmark_latency.py (vllm-project#5629)

* [CI/Build] Add tqdm to dependencies (vllm-project#5680)

* [ci] Add A100 queue into AWS CI template (vllm-project#5648)

Signed-off-by: kevin <kevin@anyscale.com>

* [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (vllm-project#5688)

* [ci][distributed] add tests for custom allreduce (vllm-project#5689)

* [Bugfix] AsyncLLMEngine hangs with asyncio.run (vllm-project#5654)

* [Doc] Update docker references (vllm-project#5614)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>

* [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (vllm-project#5650)

* [ci] Limit num gpus if specified for A100 (vllm-project#5694)

Signed-off-by: kevin <kevin@anyscale.com>

* [Misc] Improve conftest (vllm-project#5681)

* [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (vllm-project#5703)

* [Kernel] Update Cutlass int8 kernel configs for SM90 (vllm-project#5514)

Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

* [Model] Port over CLIPVisionModel for VLMs (vllm-project#5591)

* [Kernel] Update Cutlass int8 kernel configs for SM80 (vllm-project#5275)

Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

* [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (vllm-project#5715)

* [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (vllm-project#5718)

* [distributed][misc] use fork by default for mp (vllm-project#5669)

* [Model] MLPSpeculator speculative decoding support (vllm-project#4947)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>

* [Kernel] Add punica dimension for Qwen2 LoRA (vllm-project#5441)

* [BugFix] Fix test_phi3v.py (vllm-project#5725)

* [Bugfix] Add  fully sharded layer for QKVParallelLinearWithLora (vllm-project#5665)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* [Core][Distributed] add shm broadcast (vllm-project#5399)

Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

* [Kernel][CPU] Add Quick `gelu` to CPU (vllm-project#5717)

* [Doc] Documentation on supported hardware for quantization methods (vllm-project#5745)

* [BugFix] exclude version 1.15.0 for modelscope (vllm-project#5668)

* [ci][test] fix ca test in main (vllm-project#5746)

* [LoRA] Add support for pinning lora adapters in the LRU cache (vllm-project#5603)

* [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (vllm-project#5616)

* [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (vllm-project#5710)

Co-authored-by: Roger Wang <ywang@roblox.com>

* [Misc] Remove vllm-project#4789 workaround left in vllm/entrypoints/openai/run_batch.py (vllm-project#5756)

* [Bugfix] Fix pin_lora error in TPU executor (vllm-project#5760)

* [Docs][TPU] Add installation tip for TPU (vllm-project#5761)

* [core][distributed] improve shared memory broadcast (vllm-project#5754)

* [BugFix] [Kernel] Add Cutlass2x fallback kernels (vllm-project#5744)

Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

* [Distributed] Add send and recv helpers (vllm-project#5719)

* [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (vllm-project#5772)

* [doc][faq] add warning to download models for every nodes (vllm-project#5783)

* post-rebase api adjustments

* [Doc] Add "Suggest edit" button to doc pages (vllm-project#5789)

* [Doc] Add Phi-3-medium to list of supported models (vllm-project#5788)

* [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (vllm-project#5795)

* [ci] Remove aws template (vllm-project#5757)

Signed-off-by: kevin <kevin@anyscale.com>

* [Doc] Add notice about breaking changes to VLMs (vllm-project#5818)

* [Speculative Decoding] Support draft model on different tensor-parallel size than target model (vllm-project#5414)

* add pin_lora to habana components

* add WA for model loader

* fix api mismatches with ray

* tensor parallel fixes

* workers cpu alignment fix

* [Misc] Remove useless code in cpu_worker (vllm-project#5824)

* prefill/decode metadata fixes

* [Core] Add fault tolerance for `RayTokenizerGroupPool` (vllm-project#5748)

* re-enable attn metadata trimming

* worker_use_ray fix

* [doc][distributed] add both gloo and nccl tests (vllm-project#5834)

* [CI/Build] Add unit testing for FlexibleArgumentParser (vllm-project#5798)

* [Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (vllm-project#5794)

* [Hardware][TPU] Refactor TPU backend (vllm-project#5831)

* [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (vllm-project#5422)

* [Hardware][TPU] Raise errors for unsupported sampling params (vllm-project#5850)

* [CI/Build] Add E2E tests for MLPSpeculator (vllm-project#5791)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [Bugfix] Fix assertion in NeuronExecutor (vllm-project#5841)

* [Core] Refactor Worker and ModelRunner to consolidate control plane communication (vllm-project#5408)

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Co-authored-by: Stephanie <swang@anyscale.com>

* [Misc][Doc] Add Example of using OpenAI Server with VLM (vllm-project#5832)

* [bugfix][distributed] fix shm broadcast when the queue size is full (vllm-project#5801)

* [Bugfix] Fix embedding to support 2D inputs (vllm-project#5829)

* [Bugfix][TPU] Fix KV cache size calculation (vllm-project#5860)

* [CI/Build] Refactor image test assets (vllm-project#5821)

* [Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (vllm-project#5560)

Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Frontend] Add tokenize/detokenize endpoints (vllm-project#5054)

* [Hardware][TPU] Support parallel sampling & Swapping (vllm-project#5855)

* [Bugfix][TPU] Fix CPU cache allocation (vllm-project#5869)

* Support CPU inference with VSX PowerPC ISA (vllm-project#5652)

* [doc] update usage of env var to avoid conflict (vllm-project#5873)

* [Misc] Add example for LLaVA-NeXT (vllm-project#5879)

* [BugFix] Fix cuda graph for MLPSpeculator (vllm-project#5875)

Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com>

* [Doc] Add note about context length in Phi-3-Vision example (vllm-project#5887)

* [VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly (vllm-project#5880)

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>

* [Model] Add base class for LoRA-supported models (vllm-project#5018)

* [Bugfix] Fix img_sizes Parsing in Phi3-Vision (vllm-project#5888)

* [CI/Build] [1/3] Reorganize entrypoints tests (vllm-project#5526)

* add collective crash WA

* add comment to the weird mark_step

* [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (vllm-project#5896)

* [doc][misc] add note for Kubernetes users (vllm-project#5916)

* [BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (vllm-project#5876)

* [BugFix] Fix `min_tokens` behaviour for multiple eos tokens (vllm-project#5849)

* [CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (vllm-project#5922)

* [Model] Add Gemma 2 (vllm-project#5908)

* [core][misc] remove logical block (vllm-project#5882)

* [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (vllm-project#5932)

* [Hardware][TPU] Optimize KV cache swapping (vllm-project#5878)

* [VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast properly with ring buffer. (vllm-project#5905)

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

* [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (vllm-project#5956)

* [Core] Registry for processing model inputs (vllm-project#5214)

Co-authored-by: ywang96 <ywang@roblox.com>

* Unmark fused_moe config json file as executable (vllm-project#5960)

* [Hardware][Intel] OpenVINO vLLM backend (vllm-project#5379)

* [Bugfix] Better error message for MLPSpeculator when `num_speculative_tokens` is set too high (vllm-project#5894)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [CI/Build] [2/3] Reorganize entrypoints tests (vllm-project#5904)

* [Distributed] Make it clear that % should not be in tensor dict keys. (vllm-project#5927)

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>

* [Spec Decode] Introduce DraftModelRunner (vllm-project#5799)

* [Bugfix] Fix compute datatype for cutlass 3.x epilogues (vllm-project#5931)

* [ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) (vllm-project#5928)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

* [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (vllm-project#5921)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

* Support Deepseek-V2 (vllm-project#4650)

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>

* [Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled (vllm-project#5936)

* Unmark more files as executable (vllm-project#5962)

* [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (vllm-project#5963)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

* [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (vllm-project#4628)

Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>

* [Bugfix][TPU] Fix TPU sampler output (vllm-project#5978)

* [Bugfix][TPU] Fix pad slot id (vllm-project#5977)

* [Bugfix] fix missing last itl in openai completions benchmark (vllm-project#5926)

* [Misc] Extend vLLM Metrics logging API (vllm-project#5925)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* [Kernel] Add punica dimensions for Granite 3b and 8b (vllm-project#5930)

Signed-off-by: Joe Runde <joe@joerun.de>

* [Bugfix] Fix precisions in Gemma 1 (vllm-project#5913)

* [Misc] Update Phi-3-Vision Example (vllm-project#5981)

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Bugfix] Support `eos_token_id` from `config.json` (vllm-project#5954)

* [Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum (vllm-project#5974)

* [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (vllm-project#5939)

* [ CI/Build ] Added E2E Test For Compressed Tensors (vllm-project#5839)

Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic>

* [CI/Build] Add TP test for vision models (vllm-project#5892)

* [ CI/Build ] LM Eval Harness Based CI Testing (vllm-project#5838)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

* [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (vllm-project#5949)

* [CI/Build] Temporarily Remove Phi3-Vision from TP Test (vllm-project#5989)

* [CI/Build] Reuse code for checking output consistency (vllm-project#5988)

* [CI/Build] [3/3] Reorganize entrypoints tests (vllm-project#5966)

* [ci][distributed] fix device count call

[ci][distributed] fix some cuda init that makes it necessary to use spawn (vllm-project#5991)

* [Frontend]: Support base64 embedding (vllm-project#5935)

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules.  (vllm-project#5909)

Co-authored-by: sang <sangcho@anyscale.com>

* [ CI ] Temporarily Disable Large LM-Eval Tests (vllm-project#6005)

Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic>

* [Misc] Fix `get_min_capability` (vllm-project#5971)

* [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (vllm-project#5940)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

* [misc][cuda] use nvml to avoid accidentally cuda initialization (vllm-project#6007)

* [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (vllm-project#5348)

* Revert test changes

* cleanup

* llm engine cleanup

* utils.py cleanup

* custom ops refactor

* move xops to ops

* remove vllm/hpu/attn_bias.py

* whitespace fix

* revert accidental changes in rmsnorm

* Fix hpugraph hashing

* add trim_attn_metadata comment

* fix prompt bucketing:

* [ CI ] Re-enable Large Model LM Eval (vllm-project#6031)

* [doc][misc] remove deprecated api server in doc (vllm-project#6037)

* [Misc] update benchmark backend for scalellm (vllm-project#6018)

* [doc][misc] further lower visibility of simple api server (vllm-project#6041)

Co-authored-by: Simon Mo <simon.mo@hey.com>

* [Bugfix] Use RayActorError for older versions of Ray in  RayTokenizerGroupPool (vllm-project#6039)

* [Bugfix] adding chunking mechanism to fused_moe to handle large inputs (vllm-project#6029)

* add FAQ doc under 'serving' (vllm-project#5946)

* [Bugfix][Doc] Fix Doc Formatting (vllm-project#6048)

* [Bugfix] Add explicit `end_forward` calls to flashinfer (vllm-project#6044)

* [BugFix] Ensure worker model loop is always stopped at the right time (vllm-project#5987)

* [Frontend] Relax api url assertion for openai benchmarking (vllm-project#6046)

* [Model] Changes to MLPSpeculator to support tie_weights and input_scale (vllm-project#5965)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Joshua Rosenkranz <jmrosenk@us.ibm.com>

* [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default)  (vllm-project#5602)

* [Frontend] Add template related params to request (vllm-project#5709)

* [VLM] Remove `image_input_type` from VLM config (vllm-project#5852)

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

* [Doc] Reinstate doc dependencies (vllm-project#6061)

* guard model loader wa for hpu

---------

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Signed-off-by: Joe Runde <joe@joerun.de>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Jianan Gu <jianan.gu@intel.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com>
Co-authored-by: Allen.Dou <allen.dou@hotmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Kuntai Du <kuntai@uchicago.edu>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: leiwen83 <leiwen83@users.noreply.github.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Amit Garg <gargamit@microsoft.com>
Co-authored-by: Charles Riggins <liqianchen123@foxmail.com>
Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai>
Co-authored-by: zhyncs <me@zhyncs.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
Co-authored-by: Bruce Fontaine <bruce@2.7182.net>
Co-authored-by: zifeitong <zifeitong@gmail.com>
Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Co-authored-by: sergey-tinkoff <167607910+sergey-tinkoff@users.noreply.github.com>
Co-authored-by: milo157 <43028253+milo157@users.noreply.github.com>
Co-authored-by: Shukant Pal <SukantK2002@outlook.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: DearPlanet <junsong.zhang2021.work@outlook.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Joshua Rosenkranz <joshua.rosenkranz@gmail.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Jee Li <pandaleefree@163.com>
Co-authored-by: rohithkrn <rohith.nallamaddi@gmail.com>
Co-authored-by: Murali Andoorveedu <37849411+andoorve@users.noreply.github.com>
Co-authored-by: Woo-Yeon Lee <wooyeonlee0@gmail.com>
Co-authored-by: Matt Wong <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: aws-patlange <90803007+aws-patlange@users.noreply.github.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Stephanie <swang@anyscale.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: sasha0552 <admin@sasha0552.org>
Co-authored-by: Chip Kerchner <49959681+ChipKerchner@users.noreply.github.com>
Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: wangding zeng <155410488+zwd003@users.noreply.github.com>
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>
Co-authored-by: mcalman <68564154+mcalman@users.noreply.github.com>
Co-authored-by: William Lin <SolitaryThinker@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: llmpros <10524065+llmpros@users.noreply.github.com>
Co-authored-by: sang <sangcho@anyscale.com>
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Co-authored-by: James Whedbee <jamesw@telnyx.com>
Co-authored-by: Joshua Rosenkranz <jmrosenk@us.ibm.com>
Co-authored-by: danieljannai21 <100521221+danieljannai21@users.noreply.github.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants