[Core] Reduce TTFT with concurrent partial prefills #10235

joerunde · 2024-11-11T21:59:38Z

Replaces #10061, as inspired by @njhill and @comaniac's comments. Co-authored by @prashantgupta24

Context: our customers running large multi-tenanted SaaS deployments of vLLM have a problem where high volumes of small-prompt requests are usually processed smoothly, but quickly pile up in a giant queue when a small number of large-prompt requests are submitted. We see the decoding throughput drop to zero on multiple replicas when this happens.

The current chunked prefill implementation only allows a single sequence to be partially prefilled at a time. This has a few limitations:

Multiple medium-sized prompts must wait to be prefilled serially, increasing TTFT for those in the back of the queue
A single very large prompt will block all other prompts from prefilling for many iterations. This can eventually starve decoding- for example a 130k token prompt with —max-num-batched-tokens=512 will take about 250 iterations to prefill, in which time the currently decoding sequences may all finish. Send a few of these requests at once and very quickly nothing will be decoding.

This PR implements both

An explicit setting for the number of sequences that can be partially prefilled concurrently. This can be configured with --max-num-partial-prefills=N
A limit on the number of “very long prompt” sequences that can be prefilled concurrently. This can be configured with
- --max-long-partial-prefills=N to set the limit on the number of long sequences that can be concurrently prefilled. This defaults to 1 sequence.
- --long-prefill-threashold=x% to set a percentage of the context length that determines which sequences are considered "long". This defaults to 4%

This is implemented in the v0 scheduler. We’re aware that the v1 implementation is underway and will later become the default, but we need a fix for our customers soon and we hope that what we discover here may help inform a different, better solution in the v1 scheduler.

To test this we created three scenarios, a “medium request” case, a “large request” case, and a “mixed” case.

For the medium request case, we created a subset of the sharegpt dataset with 900 small requests (<50 prompt characters) and 100 of the largest requests (typically between 10k and 20k prompt characters, which we call “medium” sized). We modified the benchmark_serving.py test to not filter out any of the small or large requests, and ran it with this dataset. What we expect to find is similar throughput compared to the main branch, but much lower TTFT on the small requests. Since 10% of the requests are larger than the rest, we should see better TTFT at p90 and below, with comparable TTFT above p90.

For the large request case, we took 990 of the smallest requests from the sharegpt dataset, and then took 10 of the largest requests and duplicated the prompts until they were around 100k characters in length. We ran this in the same way as the medium request case, and here we expect to see smaller TTFT across the board since the small requests will no longer be blocked from prefilling by the few very large requests.

For the mixed case, we used 850 “small”, and 140 “medium” requests, as well as 10 "large" requests where we duplicated the prompts up to 200k characters.

All tests were run on a single 80GB A100, with the command:

python benchmarks/benchmark_serving.py --model meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-path ${test_case} --metric-percentiles 80,85,90,95,99 --request-rate 12

We ran the tests against the main branch (commit 874f551b3626321f6bf9a902b8fd9fc1fa7c7f2e), as well as this PR with the new optimization both disabled (--max-num-partial-prefills=1), and enabled (--max-num-partial-prefills=4)

The results are shown here:

The TTFT improvements are very easy to see- in the medium case we cut the p90 TTFT in half, and in the large case we cut it nearly 30x. In both cases we did not measure a throughput drop when run with --max-num-partial-prefills=1, and the throughput drop with --max-num-partial-prefills=4 is minimal.

Surprisingly, along with the massive TTFT improvements in the "mixed" test case, we also see a 4% throughput improvement (3506 tokens/s up from 3368 tokens/s). Based on the fact that ITL still looks a little slower, it seems that the throughput is higher simply because more requests were able to be successfully scheduled at the same time.

cc @rickyyx

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-11-11T21:59:49Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/model_executor/layers/sampler.py

vllm/core/scheduler.py

comaniac · 2024-11-12T23:47:55Z

vllm/config.py

@@ -1085,6 +1085,7 @@ def __init__(self,
                 max_num_batched_tokens: Optional[int],
                 max_num_seqs: int,
                 max_model_len: int,
+                 num_prefill_slots: int = 1,


Is this actually "maximum number of prefill sequence in a batch"? If so could we name it something more informative, like max_num_batched_prefill_seqs ?

It's technically only the number of partial prefills allowed in a batch. You could still have like 100 sequence groups with 5 prompt tokens each all schedule in a single step here.

max_num_partial_prefills?

comaniac · 2024-11-12T23:51:26Z

vllm/core/scheduler.py

+        # Requests with more than (4% max context length) tokens to prefill
+        # are "big".


Why this definition and threshold?

The entire goal here is to not allow decode to be starved by the prefill phase blocking on long requests- this part of the PR description:

A single very large prompt will block all other prompts from prefilling for many iterations. This can eventually starve decoding- for example a 130k token prompt with —max-num-batched-tokens=512 will take about 250 iterations to prefill, in which time the currently decoding sequences may all finish. Send a few of these requests at once and very quickly nothing will be decoding.

Just allowing concurrent partial prefills doesn't solve the problem by itself, because multiple long requests could still block up the prefill. So what we do is only allow a single long request to prefill, and allow smaller requests to be pulled from the waiting queue instead of more long ones

vllm/core/scheduler.py

joerunde · 2024-11-14T20:37:58Z

tests/core/test_chunked_prefill_scheduler.py

+
+@pytest.mark.parametrize("model", ["facebook/opt-125m"])
+@pytest.mark.parametrize("max_num_partial_prefills", [2, 4, 8])
+def test_chunked_prefill_with_actual_engine(model: str,


cc @rickyyx here's what we tried to do to test that the sampler doesn't throw any assertions- we put multiple prompts into an engine and manually step it forward with them all partially prefilled

tests/core/test_chunked_prefill_scheduler.py

mergify · 2024-11-20T10:59:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @joerunde.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]>

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]>

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]>

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]>

Signed-off-by: Joe Runde <[email protected]>

joerunde · 2025-01-13T19:32:48Z

Thanks @schoennenbeck!

Our teams are also now testing this after getting back from the holidays. I'm thinking I might move the configs to an environment variable like VLLM_V0_MAX_PARTIAL_PREFILLS since this change doesn't affect the incoming v1 scheduler and I don't want to cause confusion by creating new cli args that are immediately deprecated.

joerunde · 2025-01-17T23:26:10Z

Circling back on this: I've re-run the Large Requests benchmarks with this PR rebased on current main, and have also compared against using the V1 scheduler

The numbers have generally improved since this was first benchmarked, but this change still handily beats both main and V1 in TTFT and E2E latency. I might lean towards keeping these CLI args and adding this logic into the V1 scheduler- but of course will need more input on that as I know we're trying to keep V1 as simple as possible.

cc @comaniac @njhill, do y'all have thoughts on getting this in?

Signed-off-by: Joe Runde <[email protected]>

njhill · 2025-02-05T01:12:22Z

@comaniac WDYT about getting this PR in for v0? It makes a big difference in some of the IBM mixed production workloads, and per other comments above helps many others seeing similar problems.

We should probably extend our performance tests to incorporate these kinds of mixed workloads.

comaniac · 2025-02-05T01:20:01Z

@comaniac WDYT about getting this PR in for v0? It makes a big difference in some of the IBM mixed production workloads, and per other comments above helps many others seeing similar problems.

We should probably extend our performance tests to incorporate these kinds of mixed workloads.

No objection given the updated benchmark results. v1 scheduler actually also supports multiple partial prefills just today (#12674), although it doesn't support configurations introduced by this PR. I'll take another pass tomorrow.

njhill · 2025-02-05T02:31:24Z

@comaniac WDYT about getting this PR in for v0? It makes a big difference in some of the IBM mixed production workloads, and per other comments above helps many others seeing similar problems.
We should probably extend our performance tests to incorporate these kinds of mixed workloads.

No objection given the updated benchmark results. v1 scheduler actually also supports multiple partial prefills just today (#12674), although it doesn't support configurations introduced by this PR. I'll take another pass tomorrow.

Thanks @comaniac! Yes that's great re V1, I'm hoping that may allow us to incorporate similar logic there in a less invasive way.

comaniac

Otherwise LGTM

vllm/config.py

vllm/core/scheduler.py

joerunde · 2025-02-06T22:05:56Z

Thanks for the next pass @comaniac! I'll get on these as soon as I can- might be Monday though since I'm about to head out for a long weekend 🍹🍹🍹

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Cody Yu <[email protected]>

Signed-off-by: Joe Runde <[email protected]>

comaniac

LGTM. We could merge once the CI is green

joerunde · 2025-02-13T16:33:34Z

Thanks @comaniac!

I'll also run some final benchmarks to make sure it's still fast before merging

joerunde · 2025-02-14T23:28:22Z

Quick final check on e2e latency on the "large sharegpt" benchmark:

--	0.7.2	This PR	This PR @4 concurrent prefills
median (ms)	218	218	222
p90 (ms)	3683	3691	1576
p95 (ms)	5133	5144	1747
p99 (ms)	6449	6457	2522

Looking ⚡⚡⚡!

njhill · 2025-02-14T23:29:48Z

Thanks @joerunde @comaniac!!!

Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Prashant Gupta <[email protected]> Co-authored-by: Prashant Gupta <[email protected]> Co-authored-by: Cody Yu <[email protected]>

* [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (vllm-project#12713) Signed-off-by: Aleksandr Malyshev <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> * Refactor `Linear` handling in `TransformersModel` (vllm-project#12727) Signed-off-by: Harry Mellor <[email protected]> * [VLM] Add MLA with pure RoPE support for deepseek-vl2 models (vllm-project#12729) * [Misc] Bump the compressed-tensors version (vllm-project#12736) * [Model][Quant] Fix GLM, Fix fused module mappings for quantization (vllm-project#12634) Signed-off-by: mgoin <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: mgoin <[email protected]> * [Doc] Update PR Reminder with link to Developer Slack (vllm-project#12748) * [Bugfix] Fix OpenVINO model runner (vllm-project#12750) * [V1][Misc] Shorten `FinishReason` enum and use constant strings (vllm-project#12760) * [Doc] Remove performance warning for auto_awq.md (vllm-project#12743) * [Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 (vllm-project#12546) * [core][distributed] exact ray placement control (vllm-project#12732) Signed-off-by: youkaichao <[email protected]> * The code assumes WARP_SIZE to be equal to 32, which is not the case on ROCm (#406) Signed-off-by: Gregory Shtrasberg <[email protected]> * Merging PR vllm-project#12536 Merged via CLI script * [Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) * Add: Support for Sparse24Bitmask Compressed Models * [VLM] Use shared field to pass token ids to model * [Docs] Drop duplicate [source] links * [VLM] Qwen2.5-VL * [VLM] Update compatibility with transformers 4.49 * [ROCm][Kernel] Using the correct warp_size value * [Bugfix] Better FP8 supported defaults * [Misc][Easy] Remove the space from the file name * [Model] LoRA Support for Ultravox model (vllm-project#11253) * [Bugfix] Fix the test_ultravox.py's license (vllm-project#12806) Signed-off-by: Lu Fang <[email protected]> * Improve `TransformersModel` UX (vllm-project#12785) * [Misc] Remove duplicated DeepSeek V2/V3 model definition (vllm-project#12793) * [Misc] Improve error message for incorrect pynvml (vllm-project#12809) Signed-off-by: youkaichao <[email protected]> * [Misc] Update w2 scale loading for GPTQMarlinMoE (vllm-project#12757) * [Docs] Add Google Cloud Slides (vllm-project#12814) * [Attention] Use FA3 for MLA on Hopper (vllm-project#12807) Signed-off-by: Lucas Wilkinson <[email protected]> * [misc] Reduce number of config file requests to HuggingFace (vllm-project#12797) Signed-off-by: EC2 Default User <[email protected]> Signed-off-by: <> Co-authored-by: EC2 Default User <[email protected]> * Update README.md 20250205_aiter (#407) * Update README.md 20250205_aiter * whitespace * adding VLLM_USE_AITER=0 advice * [Misc] Remove unnecessary decode call (vllm-project#12833) * [Kernel] Make rotary_embedding ops more flexible with input shape (vllm-project#12777) * [torch.compile] PyTorch 2.6 and nightly compatibility (vllm-project#12393) Signed-off-by: youkaichao <[email protected]> * [Doc] double quote cmake package in build.inc.md (vllm-project#12840) * [Bugfix] Fix unsupported FA version check for Turing GPU (vllm-project#12828) * [V1] LoRA Support (vllm-project#10957) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Bamba Model (vllm-project#10909) Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [MISC] Check space in the file names in the pre commit checks (vllm-project#12804) Signed-off-by: Lu Fang <[email protected]> * [misc] Revert # 12833 (vllm-project#12857) Signed-off-by: <> Co-authored-by: EC2 Default User <[email protected]> * [Bugfix] FA2 illegal memory access (vllm-project#12848) * Make vllm compatible with verl (vllm-project#12824) Co-authored-by: zhangshulai <[email protected]> * [Bugfix] Missing quant_config in deepseek embedding layer (vllm-project#12836) * Prevent unecessary requests to huggingface hub (vllm-project#12837) * [MISC][EASY] Break check file names into entry and args in the pre-commit hooks (vllm-project#12880) Signed-off-by: Lu Fang <[email protected]> * [Misc] Remove unnecessary detokenization in multimodal processing (vllm-project#12868) * PR vllm-project#12718 (vllm-project#12718) * [V1] Logprobs and prompt logprobs support (vllm-project#9880) This PR is adding support for sample logprobs & prompt logprobs to vLLM v1. New behavior: - During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order. - In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized. - During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.) - Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer. Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing (vllm-project#12501) * fix rocm get_device name for moe configs (#359) * fix rocm get_device name use 'market_name' hard-code names for mi308 & mi300 * use gfx and num_CU for device name * using market_name * rename MI325_OAM to MI325X * rm (duplicate) MI300X_OAM * rename mi308 * [V1] LM Eval With Streaming Integration Tests (vllm-project#11590) * [Bugfix] Fix disagg hang caused by the prefill and decode communication issues (vllm-project#12723) Signed-off-by: Lu Fang <[email protected]> * [V1][Minor] Remove outdated comment (vllm-project#12928) Signed-off-by: Woosuk Kwon <[email protected]> * [V1] Move KV block hashes from Request to KVCacheManager (vllm-project#12922) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping (vllm-project#12905) * [Misc] Fix typo in the example file (vllm-project#12896) Signed-off-by: Zhao Ke <[email protected]> * [Bugfix] Fix multi-round chat error when mistral tokenizer is used (vllm-project#12859) Signed-off-by: Zifei Tong <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [bugfix] respect distributed_executor_backend in world_size=1 (vllm-project#12934) Signed-off-by: youkaichao <[email protected]> * [Misc] Add offline test for disaggregated prefill (vllm-project#12418) * [V1][Minor] Move cascade attn logic outside _prepare_inputs (vllm-project#12943) Signed-off-by: Woosuk Kwon <[email protected]> * [Build] Make pypi install work on CPU platform (vllm-project#12874) * [Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi (vllm-project#12812) Signed-off-by: Sanju C Sudhakaran <[email protected]> * [misc] Add LoRA to benchmark_serving (vllm-project#12898) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Misc] Log time consumption on weight downloading (vllm-project#12926) * [CI] Resolve transformers-neuronx version conflict (vllm-project#12925) * [Doc] Correct HF repository for TeleChat2 models (vllm-project#12949) * [Misc] Add qwen2.5-vl BNB support (vllm-project#12944) * [CI/Build] Auto-fix Markdown files (vllm-project#12941) * [Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU (vllm-project#12935) Signed-off-by: Shangming Cai <[email protected]> * [bugfix] fix early import of flash attention (vllm-project#12959) Signed-off-by: youkaichao <[email protected]> * [VLM] Merged multi-modal processor for GLM4V (vllm-project#12449) Signed-off-by: Jee Jee Li <[email protected]> * [V1][Minor] Remove outdated comment (vllm-project#12968) Signed-off-by: Woosuk Kwon <[email protected]> * [RFC] [Mistral] FP8 format (vllm-project#10130) Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> * [V1] Cache `uses_mrope` in GPUModelRunner (vllm-project#12969) * [core] port pynvml into vllm codebase (vllm-project#12963) Signed-off-by: youkaichao <[email protected]> * [MISC] Always import version library first in the vllm package (vllm-project#12979) Signed-off-by: Lu Fang <[email protected]> * [core] improve error handling when wake up from sleep mode (vllm-project#12981) Signed-off-by: youkaichao <[email protected]> * [core][rlhf] add colocate example for RLHF (vllm-project#12984) Signed-off-by: youkaichao <[email protected]> * [V1] Use msgpack for core request serialization (vllm-project#12918) Signed-off-by: Nick Hill <[email protected]> * Check if selected backend is None in get_attn_backend_cls() (vllm-project#12975) Signed-off-by: Yuan Tang <[email protected]> * [core] fix sleep mode and pytorch checkpoint compatibility (vllm-project#13001) Signed-off-by: youkaichao <[email protected]> * [Doc] Add link to tool_choice tracking issue in tool_calling.md (vllm-project#13003) Signed-off-by: Yuan Tang <[email protected]> * [misc] Add retries with exponential backoff for HF file existence check (vllm-project#13008) * [Bugfix] Clean up and fix multi-modal processors (vllm-project#13012) Signed-off-by: DarkLight1337 <[email protected]> * Fix seed parameter behavior in vLLM (vllm-project#13007) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <[email protected]> * Fixing the output formatting (#414) * [Model] Ultravox Model: Support v0.5 Release (vllm-project#12912) Signed-off-by: Farzad Abdolhosseini <[email protected]> * [misc] Fix setup.py condition to avoid AMD from being mistaken with CPU (vllm-project#13022) Signed-off-by: kevin <[email protected]> * [V1][Minor] Move scheduler outputs to a separate file (vllm-project#13062) Signed-off-by: Woosuk Kwon <[email protected]> * [Docs] Annouce Meta Meetup (vllm-project#13065) Signed-off-by: simon-mo <[email protected]> * [Bugfix] Support missing tool parameters in mistral tokenizer (vllm-project#12884) Signed-off-by: Florian Greinacher <[email protected]> * [Benchmark] Add BurstGPT to benchmark_serving (vllm-project#13063) Signed-off-by: Woosuk Kwon <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Core] Don't do platform detection at import time (vllm-project#12933) Signed-off-by: Russell Bryant <[email protected]> * [Misc] LoRA - Refactor Punica ops tests (vllm-project#12970) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix]: Reasoning output bug according to the chat template change (vllm-project#13025) Signed-off-by: Ce Gao <[email protected]> * [V1][Metrics] Add GPU prefix cache hit rate % gauge (vllm-project#12592) * [executor] init `local_rank` as device index (vllm-project#13027) Signed-off-by: Mengqing Cao <[email protected]> * [ROCm] Using a more precise memory profiling (vllm-project#12624) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Build] Fix cuda link target of cumem_allocator in CPU env (vllm-project#12863) Signed-off-by: YuhongGuo <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [Platform] add pre_register_and_update function (vllm-project#12432) Signed-off-by: wangxiyuan <[email protected]> * [Bugfix] fix flaky test (vllm-project#13089) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <[email protected]> * [V1][Metrics] Add several request timing histograms (vllm-project#12644) Signed-off-by: Mark McLoughlin <[email protected]> * Set `torch_dtype` in `TransformersModel` (vllm-project#13088) Signed-off-by: Harry Mellor <[email protected]> * [Misc] Fix typo at comments at metrics.py (vllm-project#13024) * [Bugfix] Do not use resource module on Windows (vllm-project#12858) (vllm-project#13029) * [BugFix] Pop instead of del CUDA_VISIBLE_DEVICES (vllm-project#12962) Signed-off-by: Hollow Man <[email protected]> * Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 (vllm-project#13023) * Add tuned moe config for qwen1.5_moe_A2.7B (#398) * Add tuned moe config for qwen1.5_moe_A2.7B * Add more sweep parameters on qwen2_moe * Add tp = 1,2,4,8 after applying PR12838 * Rename config name by deleting "_OAM" --------- Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Divakar Verma <[email protected]> * [CI/Build][Bugfix] Fix CPU backend default threads num (vllm-project#13077) * Removing non-existent parameter * [Doc] Improve OpenVINO installation doc (vllm-project#13102) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Guided decoding falls back to outlines when fails to import xgrammar (vllm-project#12976) Signed-off-by: Yuan Tang <[email protected]> * [Misc] Move pre-commit suggestion back to the end (vllm-project#13114) Signed-off-by: Russell Bryant <[email protected]> * [RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM (vllm-project#12518) Signed-off-by: Keyun Tong <[email protected]> * [Model] IBM/NASA Prithvi Geospatial model (vllm-project#12830) * [ci] Add more source file dependencies for some tests (vllm-project#13123) Signed-off-by: <> Co-authored-by: EC2 Default User <[email protected]> * [Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (vllm-project#12921) Signed-off-by: Lingfan Yu <[email protected]> * Bump helm/kind-action from 1.10.0 to 1.12.0 (vllm-project#11612) * Bump actions/stale from 9.0.0 to 9.1.0 (vllm-project#12462) * Bump helm/chart-testing-action from 2.6.1 to 2.7.0 (vllm-project#12463) * Bump actions/setup-python from 5.3.0 to 5.4.0 (vllm-project#12672) * Further reduce the HTTP calls to huggingface.co (vllm-project#13107) * [Misc] AMD Build Improvements (vllm-project#12923) * [Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request (vllm-project#13108) * [Bugfix] Fix num video tokens calculation for Qwen2-VL (vllm-project#13148) Signed-off-by: DarkLight1337 <[email protected]> * [Frontend] Generate valid tool call IDs when using `tokenizer-mode=mistral` (vllm-project#12332) * [Misc] Delete unused LoRA modules (vllm-project#13151) * Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path (vllm-project#12998) Signed-off-by: Lu Fang <[email protected]> * [CI/Build] Use mypy matcher for pre-commit CI job (vllm-project#13162) Signed-off-by: Russell Bryant <[email protected]> * Update Benchmark Profiling Scripts (#417) * Update profiling benchmarks * Fix linter errors --------- Co-authored-by: AdrianAbeyta <[email protected]> * [CORE] [QUANT] Support for GPTQModel's `dynamic` quantization per module override/control (vllm-project#7086) * [Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity (vllm-project#13119) * DS V2V3 fix for same file * Lint * updating manfiest (#416) * [CI] Fix failing FP8 cpu offload test (vllm-project#13170) Signed-off-by: mgoin <[email protected]> * Aiter base (#419) * Using upstream FA repo. Building aiter in the base docker image * Renaming the file to match upstream naming * [V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort (vllm-project#13173) Signed-off-by: andoorve <[email protected]> * [CI/Build] Ignore ruff warning up007 (vllm-project#13182) Signed-off-by: Russell Bryant <[email protected]> * [perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance (vllm-project#12706) * [NVIDIA] Support nvfp4 quantization (vllm-project#12784) * [Bugfix][Example] Fix GCed profiling server for TPU (vllm-project#12792) Signed-off-by: mgoin <[email protected]> * [VLM] Implement merged multimodal processor for Mllama (vllm-project#11427) * Simplify logic of locating CUDART so file path (vllm-project#13203) Signed-off-by: Lu Fang <[email protected]> * [Build] Automatically use the wheel of the base commit with Python-only build (vllm-project#13178) * [Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case (vllm-project#13097) * [Frontend] Move CLI code into vllm.cmd package (vllm-project#12971) * Allow Unsloth Dynamic 4bit BnB quants to work (vllm-project#12974) * [CI/Build] Allow ruff to auto-fix some issues (vllm-project#13180) Signed-off-by: Russell Bryant <[email protected]> * [V1][core] Implement pipeline parallel on Ray (vllm-project#12996) * [VLM] Remove input processor from clip and siglip (vllm-project#13165) * [Frontend] Pass pre-created socket to uvicorn (vllm-project#13113) * [V1] Clarify input processing and multimodal feature caching logic (vllm-project#13211) * [VLM] Merged multi-modal processor for Molmo (vllm-project#12966) * [V1][Core] Add worker_base for v1 worker (vllm-project#12816) Signed-off-by: Aoyu <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: Aoyu <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Misc] Qwen2.5-VL Optimization (vllm-project#13155) * [VLM] Separate text-only and vision variants of the same model architecture (vllm-project#13157) * [Bugfix] Missing Content Type returns 500 Internal Server Error (vllm-project#13193) * [Frontend] Add `/v1/audio/transcriptions` OpenAI API endpoint (vllm-project#12909) * Initial attempt to adjust codeowners to the ROCm fork (#420) * Applying weight padding to deepseek (#421) * Add label if pre-commit passes (vllm-project#12527) Signed-off-by: Harry Mellor <[email protected]> * [Model] DeepSeek Tunings (#423) * fused_moe config for DSv3 on MI300X updated * Add tuning script and post processing script Signed-off-by: Randall Smith <[email protected]> * Add modification to fp8_utils for tuning Signed-off-by: Randall Smith <[email protected]> * update tuning script and add the configs Signed-off-by: Randall Smith <[email protected]> * slightly better tunings Signed-off-by: Randall Smith <[email protected]> * benchmark_moe.py is updated to generate more accurate MoE configs and a specific MoE config for DSv3 is added * Bug in sgl_moe_align_block_size() is fixed by Greg * Generate fp8_w8a8 config for MI300XHF * tunings that don't give garbage output Signed-off-by: Randall Smith <[email protected]> * More accurate tunings Signed-off-by: Randall Smith <[email protected]> * More accurate tunings and reject inaccurate configs Signed-off-by: Randall Smith <[email protected]> * add new tunings Signed-off-by: Randall Smith <[email protected]> * rename tuning script and add benchmark script to use for optimizing blockwise quant Signed-off-by: Randall Smith <[email protected]> * remove white space from file names Signed-off-by: Randall Smith <[email protected]> * remove white space from file names Signed-off-by: Randall Smith <[email protected]> * Remove some unnecessary changes Signed-off-by: Randall Smith <[email protected]> * don't use space in file names Signed-off-by: Randall Smith <[email protected]> * remove XHF tunings Signed-off-by: Randall Smith <[email protected]> * remove OAM from file name Signed-off-by: Randall Smith <[email protected]> * rmeove OAM from file names Signed-off-by: Randall Smith <[email protected]> * yapf Signed-off-by: Randall Smith <[email protected]> * update config name Signed-off-by: Randall Smith <[email protected]> * remove benchmark_moe.py changes Signed-off-by: Randall Smith <[email protected]> * remove is_contiguous Signed-off-by: Randall Smith <[email protected]> * use more recent fp8_utils.py Signed-off-by: Randall Smith <[email protected]> * remove is_contiguous Signed-off-by: Randall Smith <[email protected]> --------- Signed-off-by: Randall Smith <[email protected]> Co-authored-by: qli88 <[email protected]> * Optimize moe_align_block_size for deepseek_v3 (vllm-project#12850) Signed-off-by: mgoin <[email protected]> * [Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (vllm-project#13198) Signed-off-by: Tyler Michael Smith <[email protected]> * Revert "Add label if pre-commit passes" (vllm-project#13242) * [ROCm] Avoid using the default stream on ROCm (vllm-project#13238) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Kernel] Fix awq error when n is not divisable by 128 (vllm-project#13227) * [V1] Consolidate MM cache size to vllm.envs (vllm-project#13239) * [Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (vllm-project#13250) * [Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config (vllm-project#13237) * [Bugfix] Offline example of disaggregated prefill (vllm-project#13214) * [Misc] Remove redundant statements in scheduler.py (vllm-project#13229) * Consolidate Llama model usage in tests (vllm-project#13094) * Expand MLA to support most types of quantization (vllm-project#13181) * [V1] LoRA - Enable Serving Usecase (vllm-project#12883) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [ROCm][V1] Add intial ROCm support to V1 (vllm-project#12790) * [Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (vllm-project#13126) * [WIP] TPU V1 Support Refactored (vllm-project#13049) * [Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch (vllm-project#12927) Signed-off-by: Pooya Davoodi <[email protected]> * [Bugfix] Fix missing parentheses (vllm-project#13263) * [Misc] Log time consumption of sleep and wake-up (vllm-project#13115) Signed-off-by: Jun Duan <[email protected]> * [VLM] Keep track of whether prompt replacements have been applied (vllm-project#13215) * [V1] Simplify GPUModelRunner._update_states check (vllm-project#13265) * Support logit_bias in v1 Sampler (vllm-project#13079) * [Core] choice-based structured output with xgrammar (vllm-project#12632) * [Hardware][Gaudi][Bugfix] Fix error for guided decoding (vllm-project#12317) * Removing bad config (#425) * The order in the file is important. One needs to be explicitly be added to each following path for their ownership to apply (#427) * [Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (vllm-project#13236) Signed-off-by: mgoin <[email protected]> * [Core] Reduce TTFT with concurrent partial prefills (vllm-project#10235) Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Prashant Gupta <[email protected]> Co-authored-by: Prashant Gupta <[email protected]> Co-authored-by: Cody Yu <[email protected]> * [V1][Core] min_p sampling support (vllm-project#13191) Signed-off-by: Aoyu <[email protected]> Co-authored-by: Aoyu <[email protected]> * [V1][CI] Fix failed v1-test because of min_p (vllm-project#13316) Signed-off-by: Woosuk Kwon <[email protected]> * [V1][Sampler] Don't apply temp for greedy-only (vllm-project#13311) Signed-off-by: Nick Hill <[email protected]> * [V1][PP] Fix memory profiling in PP (vllm-project#13315) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm (vllm-project#13235) * [Bugfix][Docs] Fix offline Whisper (vllm-project#13274) * [Bugfix] Massage MLA's usage of flash attn for RoCM (vllm-project#13310) * [BugFix] Don't scan entire cache dir when loading model (vllm-project#13302) * [Bugfix]Fix search start_index of stop_checker (vllm-project#13280) * [Bugfix] Fix qwen2.5-vl image processor (vllm-project#13286) * [V1][Metrics] Add iteration_tokens_total histogram from V0 (vllm-project#13288) * [AMD] [Model] DeepSeek tunings (vllm-project#13199) * [V1][PP] Run engine busy loop with batch queue (vllm-project#13064) * [ci/build] update flashinfer (vllm-project#13323) * [Doc] [2/N] Add Fuyu E2E example for multimodal processor (vllm-project#13331) * [V1][Spec Decode] Ngram Spec Decode (vllm-project#12193) Signed-off-by: LiuXiaoxuanPKU <[email protected]> * [Quant] Add `SupportsQuant` to phi3 and clip (vllm-project#13104) * [Bugfix] Pin xgrammar to 0.1.11 (vllm-project#13338) * avoid calling hf_list_repo_files for local model Signed-off-by: isotr0py <[email protected]> * annotation Signed-off-by: isotr0py <[email protected]> * [BugFix] Enhance test_pos_encoding to support execution on multi-devices (vllm-project#13187) Signed-off-by: wchen61 <[email protected]> * [V1] Update doc and examples for H2O-VL (vllm-project#13349) Signed-off-by: Roger Wang <[email protected]> * [ci] skip failed tests for flashinfer (vllm-project#13352) Signed-off-by: youkaichao <[email protected]> * [platform] add base class for communicators (vllm-project#13208) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Fix 2 Node and Spec Decode tests (vllm-project#13341) Signed-off-by: DarkLight1337 <[email protected]> * [Docs] Change myenv to vllm. Update python_env_setup.inc.md (vllm-project#13325) * [V1][BugFix] Add __init__.py to v1/spec_decode/ (vllm-project#13359) Signed-off-by: Woosuk Kwon <[email protected]> * [V1][PP] Cache Intermediate Tensors (vllm-project#13353) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend edge case (vllm-project#13358) Signed-off-by: Isotr0py <[email protected]> * [V1][BugFix] Clean up rejection sampler & Fix warning msg (vllm-project#13362) Signed-off-by: Woosuk Kwon <[email protected]> * [V1][Misc] Avoid unnecessary log output (vllm-project#13289) * [Feature][Spec Decode] Simplify the use of Eagle Spec Decode (vllm-project#12304) Signed-off-by: Shangming Cai <[email protected]> * Fix spelling error in index.md (vllm-project#13369) * Run v1 benchmark and integrate with PyTorch OSS benchmark database (vllm-project#13068) Signed-off-by: Huy Do <[email protected]> * [MISC] tiny fixes (vllm-project#13378) * [VLM] Check required fields before initializing field config in `DictEmbeddingItems` (vllm-project#13380) * [Model] Support Mamba2 (Codestral Mamba) (vllm-project#9292) Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> * [Bugfix] fix xpu communicator (vllm-project#13368) Signed-off-by: yan ma <[email protected]> * [Bugfix] Fix VLLM_USE_MODELSCOPE issue (vllm-project#13384) * Updating PR template to point people to the upstream repo. Updating codeowners (#431) * Enabling the ROCm-vLLM CI on MI250 machines (#432) * Enabling ROCm CI on MI250 machines: - correct build target - correct queue Signed-off-by: Alexei V. Ivanov <[email protected]> --------- Signed-off-by: Alexei V. Ivanov <[email protected]> * Optimization for quantized gemm skinny sizes (#411) * Optimization for quantized gemm skinny sizes * lint fix * Add support for bf16/fp16 * code cleanup * code cleanup * lint fix2 * cleanup * Moved the logic into tuned gemm to preserve API compatibility --------- Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * Restricting FP8 wvSplitk to MI300x (#439) * Remove mi300a (#440) * Removing gfx940 and gfx941 targets. These have been deprecated in favor of gfx942 for MI300X Signed-off-by: Gregory Shtrasberg <[email protected]> * Remove from custom kernels as well --------- Signed-off-by: Gregory Shtrasberg <[email protected]> * resolve diff for mixtral8x7B configs (#437) Signed-off-by: Divakar Verma <[email protected]> * Torch version bump to fix tunable ops (#442) * Advance torch commit to be past pytorch/pytorch#144942 to fix tunable ops * Make sure to use the submodule commit compatible with the main aiter commit * bugfix: remove unused argument passed to the forward pass of ReplicatedLinear layer Signed-off-by: vllmellm <[email protected]> --------- Signed-off-by: Aleksandr Malyshev <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: EC2 Default User <[email protected]> Signed-off-by: <> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Zhao Ke <[email protected]> Signed-off-by: Zifei Tong <[email protected]> Signed-off-by: Sanju C Sudhakaran <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <[email protected]> Signed-off-by: Farzad Abdolhosseini <[email protected]> Signed-off-by: kevin <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Florian Greinacher <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Ce Gao <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: YuhongGuo <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Hollow Man <[email protected]> Signed-off-by: Keyun Tong <[email protected]> Signed-off-by: Lingfan Yu <[email protected]> Signed-off-by: andoorve <[email protected]> Signed-off-by: Aoyu <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Pooya Davoodi <[email protected]> Signed-off-by: Jun Duan <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Prashant Gupta <[email protected]> Signed-off-by: LiuXiaoxuanPKU <[email protected]> Signed-off-by: isotr0py <[email protected]> Signed-off-by: wchen61 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Huy Do <[email protected]> Signed-off-by: yan ma <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: Divakar Verma <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Sanju C Sudhakaran <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Sumit Vij <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: arakowsk-amd <[email protected]> Co-authored-by: Jitse Klomp <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: ZSL98 <[email protected]> Co-authored-by: zhangshulai <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Amit Garg <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Ke Zhao <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Shaoting <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: Jun Duan <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <[email protected]> Co-authored-by: Farzad Abdolhosseini <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Florian Greinacher <[email protected]> Co-authored-by: Ce Gao <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: Yuhong Guo <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Jewon Lee <[email protected]> Co-authored-by: MoonRide303 <[email protected]> Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <[email protected]> Co-authored-by: sky0530 <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Keyun Tong <[email protected]> Co-authored-by: Christian Pinto <[email protected]> Co-authored-by: Lingfan Yu <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Shiyan Deng <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: Adrian Abeyta <[email protected]> Co-authored-by: AdrianAbeyta <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]> Co-authored-by: Yida Wu <[email protected]> Co-authored-by: Murali Andoorveedu <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: LikeSundayLikeRain <[email protected]> Co-authored-by: Daniel Han <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Aoyu <[email protected]> Co-authored-by: Aoyu <[email protected]> Co-authored-by: 燃 <[email protected]> Co-authored-by: Vaibhav Jain <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: qli88 <[email protected]> Co-authored-by: Jinzhen Lin <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Wang Ran (汪然) <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: Kero Liang <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: Pooya Davoodi <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Yu-Zhou <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Prashant Gupta <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: isotr0py <[email protected]> Co-authored-by: wchen61 <[email protected]> Co-authored-by: 凌 <[email protected]> Co-authored-by: yankooo <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: r.4ntix <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Hashem Hashemi <[email protected]> Co-authored-by: vllmellm <[email protected]>

joerunde commented Nov 11, 2024

View reviewed changes

vllm/model_executor/layers/sampler.py Show resolved Hide resolved

joerunde commented Nov 11, 2024

View reviewed changes

vllm/core/scheduler.py Outdated Show resolved Hide resolved

prashantgupta24 reviewed Nov 12, 2024

View reviewed changes

vllm/core/scheduler.py Outdated Show resolved Hide resolved

prashantgupta24 reviewed Nov 12, 2024

View reviewed changes

vllm/core/scheduler.py Outdated Show resolved Hide resolved

comaniac requested changes Nov 13, 2024

View reviewed changes

mergify bot added the frontend label Nov 13, 2024

joerunde marked this pull request as ready for review November 14, 2024 20:36

joerunde requested review from WoosukKwon, zhuohan123, youkaichao, alexm-redhat and njhill as code owners November 14, 2024 20:36

joerunde commented Nov 14, 2024

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 14, 2024

prashantgupta24 reviewed Nov 15, 2024

View reviewed changes

tests/core/test_chunked_prefill_scheduler.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Nov 20, 2024

joerunde and others added 12 commits November 20, 2024 10:02

🐛 fix multi-chunked-prefill sampler bug

f97eacf

Signed-off-by: Joe Runde <[email protected]>

🚧 add num_prefill_slots arg

b50a6b8

Signed-off-by: Prashant Gupta <[email protected]>

✨ start to write prefill slot logic

7f23c04

Signed-off-by: Joe Runde <[email protected]>

🎨 format

d271cc9

Signed-off-by: Prashant Gupta <[email protected]>

✨ update num tokens for prefill slots

b2cb96f

Signed-off-by: Joe Runde <[email protected]>

♻️ add schedule_chunked_prefill logic

c349ac0

Signed-off-by: Prashant Gupta <[email protected]>

♻️ change function name

e20518d

Signed-off-by: Prashant Gupta <[email protected]>

✨ reserve incoming prefill slots

6ba0e34

Signed-off-by: Joe Runde <[email protected]>

🎨 fix some typos

a7491cc

Signed-off-by: Prashant Gupta <[email protected]>

⚡ finish awesome scheduler

1ee6fea

Signed-off-by: Joe Runde <[email protected]>

🐛 fix the deadlocks

517915a

Signed-off-by: Joe Runde <[email protected]>

📝 Add more docstrings

ed298c3

Signed-off-by: Joe Runde <[email protected]>

Merge remote-tracking branch 'upstream/main' into prefill-slots

6de9b56

joerunde added 3 commits January 21, 2025 16:34

Merge remote-tracking branch 'upstream/main' into prefill-slots

c1ef186

Merge remote-tracking branch 'upstream/main' into prefill-slots

5977716

🎨 fmt

5fb2196

Signed-off-by: Joe Runde <[email protected]>

njhill mentioned this pull request Feb 5, 2025

[Core] Add dynamic chunk size calculation #10061

Closed

comaniac reviewed Feb 5, 2025

View reviewed changes

joerunde and others added 5 commits February 12, 2025 10:45

Merge remote-tracking branch 'upstream/main' into prefill-slots

5840994

🔧 update config and tests

ef25a0d

Signed-off-by: Joe Runde <[email protected]>

⏪ revert style change

1d890af

Signed-off-by: Joe Runde <[email protected]>

♻️ cannot -> can

cf33c63

Signed-off-by: Joe Runde <[email protected]>

Update vllm/core/scheduler.py

edb461e

Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Cody Yu <[email protected]>

joerunde force-pushed the prefill-slots branch from 25247e2 to edb461e Compare February 12, 2025 23:31

joerunde added 2 commits February 12, 2025 16:37

🎨 fmt

686f035

Signed-off-by: Joe Runde <[email protected]>

🐛 fixup renamed fn ref

dad07a8

Signed-off-by: Joe Runde <[email protected]>

comaniac approved these changes Feb 13, 2025

View reviewed changes

comaniac merged commit 3bcb8c7 into vllm-project:main Feb 14, 2025
38 checks passed

joerunde deleted the prefill-slots branch February 14, 2025 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Reduce TTFT with concurrent partial prefills #10235

[Core] Reduce TTFT with concurrent partial prefills #10235

joerunde commented Nov 11, 2024 •

edited

Loading

github-actions bot commented Nov 11, 2024

comaniac Nov 12, 2024

joerunde Nov 13, 2024

comaniac Nov 12, 2024

joerunde Nov 13, 2024

joerunde Nov 14, 2024

mergify bot commented Nov 20, 2024

joerunde commented Jan 13, 2025

joerunde commented Jan 17, 2025

njhill commented Feb 5, 2025

comaniac commented Feb 5, 2025 •

edited

Loading

njhill commented Feb 5, 2025

comaniac left a comment

joerunde commented Feb 6, 2025

comaniac left a comment

joerunde commented Feb 13, 2025

joerunde commented Feb 14, 2025 •

edited

Loading

njhill commented Feb 14, 2025

		# Requests with more than (4% max context length) tokens to prefill
		# are "big".

[Core] Reduce TTFT with concurrent partial prefills #10235

[Core] Reduce TTFT with concurrent partial prefills #10235

Conversation

joerunde commented Nov 11, 2024 • edited Loading

PR Title and Classification

Code Quality

Adding or changing kernels

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Nov 11, 2024

comaniac Nov 12, 2024

Choose a reason for hiding this comment

joerunde Nov 13, 2024

Choose a reason for hiding this comment

comaniac Nov 12, 2024

Choose a reason for hiding this comment

joerunde Nov 13, 2024

Choose a reason for hiding this comment

joerunde Nov 14, 2024

Choose a reason for hiding this comment

mergify bot commented Nov 20, 2024

joerunde commented Jan 13, 2025

joerunde commented Jan 17, 2025

njhill commented Feb 5, 2025

comaniac commented Feb 5, 2025 • edited Loading

njhill commented Feb 5, 2025

comaniac left a comment

Choose a reason for hiding this comment

joerunde commented Feb 6, 2025

comaniac left a comment

Choose a reason for hiding this comment

joerunde commented Feb 13, 2025

joerunde commented Feb 14, 2025 • edited Loading

njhill commented Feb 14, 2025

joerunde commented Nov 11, 2024 •

edited

Loading

comaniac commented Feb 5, 2025 •

edited

Loading

joerunde commented Feb 14, 2025 •

edited

Loading