vLLM, a high-throughput inference engine for large language models, is experiencing a critical bug affecting FP8 model configurations, while continuing to push performance boundaries through kernel optimizations and enhanced quantization support.
Recent activities highlight significant efforts in performance optimization, particularly for AMD GPUs and CUDA graphs. However, a critical issue (#8641) with FP8 models using FlashInfer indicates potential disruptions for users relying on these configurations. The development team remains active, with contributions focused on bug fixes, feature enhancements, and infrastructure improvements.
Recent issues reveal ongoing challenges with performance and compatibility. Key issues include:
These issues suggest a need for improved stability and documentation to support diverse configurations.
ywang96
njhill
Isotr0py
simon-mo
charlifu
The team is actively enhancing performance through kernel optimizations and addressing bugs to improve user experience.
Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 98 | 49 | 275 | 0 | 1 |
14 Days | 215 | 90 | 749 | 0 | 1 |
30 Days | 362 | 173 | 1245 | 0 | 1 |
All Time | 4497 | 2922 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
Dipika Sikka | 5 | 5/3/1 | 11 | 47 | 6998 | |
Cyrus Leung | 4 | 9/8/0 | 20 | 175 | 5154 | |
Lucas Wilkinson (LucasWilkinson) | 1 | 1/0/0 | 2 | 29 | 4831 | |
Alexander Matveev | 2 | 5/4/0 | 7 | 47 | 3733 | |
Mor Zusman (mzusman) | 1 | 1/0/0 | 1 | 20 | 2846 | |
Kyle Mistele | 1 | 2/2/0 | 3 | 26 | 2765 | |
Nick Hill | 4 | 6/6/0 | 15 | 45 | 2562 | |
Tyler Michael Smith | 2 | 2/1/0 | 14 | 24 | 2558 | |
Michael Goin | 6 | 3/1/0 | 8 | 21 | 2525 | |
Lily Liu | 1 | 1/1/0 | 2 | 21 | 2388 | |
Isotr0py | 4 | 9/9/0 | 18 | 37 | 2311 | |
Patrick von Platen | 2 | 4/3/0 | 5 | 26 | 2024 | |
Peter Salas | 4 | 2/0/0 | 4 | 41 | 1973 | |
Charlie Fu | 1 | 2/2/0 | 2 | 8 | 1916 | |
Alex Brooks | 1 | 2/2/0 | 3 | 11 | 1774 | |
afeldman-nm | 2 | 1/0/0 | 3 | 106 | 1631 | |
Yang Fan | 1 | 0/0/0 | 1 | 14 | 1562 | |
youkaichao | 4 | 10/10/0 | 22 | 48 | 1504 | |
Wenxiang | 1 | 1/1/0 | 2 | 13 | 1339 | |
Simon Mo | 2 | 8/8/0 | 12 | 18 | 1235 | |
Yangshen⚡Deng | 1 | 0/0/0 | 1 | 21 | 1101 | |
Robert Shaw | 6 | 0/0/0 | 15 | 23 | 878 | |
Roger Wang | 6 | 6/6/0 | 17 | 21 | 877 | |
bnellnm | 2 | 4/1/1 | 4 | 32 | 846 | |
Geun, Lim | 1 | 1/1/0 | 1 | 6 | 835 | |
Megha Agarwal | 1 | 0/0/0 | 1 | 21 | 834 | |
Yohan Na | 1 | 0/0/0 | 1 | 6 | 825 | |
sroy745 | 1 | 1/1/0 | 3 | 17 | 797 | |
Shawn Tan | 1 | 1/0/0 | 1 | 4 | 792 | |
William Lin | 2 | 2/1/0 | 8 | 29 | 781 | |
Luka Govedič | 2 | 1/0/0 | 2 | 16 | 749 | |
Li, Jiang | 1 | 0/0/0 | 1 | 18 | 729 | |
chenqianfzh | 2 | 1/1/0 | 2 | 6 | 725 | |
Woosuk Kwon | 6 | 4/3/0 | 17 | 16 | 693 | |
Cody Yu | 2 | 2/2/0 | 7 | 18 | 672 | |
Antoni Baum | 2 | 0/0/0 | 3 | 20 | 667 | |
Pavani Majety | 2 | 0/0/0 | 3 | 11 | 639 | |
ElizaWszola | 1 | 0/0/0 | 1 | 12 | 638 | |
Jungho Christopher Cho | 1 | 0/0/0 | 1 | 9 | 629 | |
rasmith | 2 | 4/1/1 | 2 | 5 | 528 | |
Jiaxin Shan | 1 | 1/1/0 | 2 | 11 | 357 | |
Harsha vardhan manoj Bikki | 2 | 0/0/0 | 2 | 9 | 329 | |
ywfang | 1 | 1/1/0 | 1 | 7 | 318 | |
Kunshang Ji | 3 | 3/2/0 | 5 | 15 | 307 | |
None (zifeitong) | 1 | 1/0/0 | 1 | 7 | 226 | |
Kaunil Dhruv | 1 | 0/0/0 | 1 | 7 | 186 | |
Wei-Sheng Chin | 2 | 2/1/0 | 2 | 1 | 159 | |
Joe Runde | 1 | 6/4/0 | 5 | 9 | 138 | |
Maureen McElaney | 1 | 0/0/0 | 1 | 1 | 128 | |
Aaron Pham | 1 | 2/1/0 | 1 | 27 | 127 | |
Alexey Kondratiev(AMD) | 2 | 5/4/1 | 7 | 6 | 127 | |
manikandan.tm@zucisystems.com | 1 | 0/0/0 | 1 | 6 | 125 | |
Prashant Gupta | 1 | 0/0/0 | 1 | 3 | 103 | |
Gregory Shtrasberg | 2 | 1/1/0 | 2 | 5 | 100 | |
Kyle Sayers | 1 | 0/0/0 | 1 | 4 | 79 | |
Richard Liu | 1 | 0/0/0 | 1 | 5 | 78 | |
Kevin Lin | 1 | 3/3/0 | 3 | 6 | 72 | |
sumitd2 | 1 | 2/1/0 | 2 | 2 | 56 | |
omrishiv | 1 | 0/0/0 | 1 | 3 | 55 | |
Adam Lugowski | 1 | 0/0/0 | 1 | 1 | 54 | |
Kevin H. Luu | 1 | 3/0/2 | 3 | 4 | 51 | |
Ronen Schaffer | 1 | 0/0/0 | 1 | 4 | 50 | |
Jonathan Berkhahn | 1 | 0/0/0 | 1 | 9 | 49 | |
TimWang | 1 | 1/0/0 | 1 | 2 | 48 | |
Aarni Koskela | 1 | 1/1/0 | 1 | 3 | 43 | |
Rui Qiao | 1 | 1/1/0 | 2 | 5 | 39 | |
kushanam | 1 | 0/0/0 | 1 | 1 | 38 | |
Pooya Davoodi | 1 | 1/1/0 | 1 | 2 | 35 | |
wnma | 1 | 0/0/0 | 1 | 1 | 27 | |
Jonas M. Kübler | 1 | 0/0/0 | 1 | 2 | 25 | |
Stas Bekman | 1 | 0/0/0 | 2 | 2 | 22 | |
wang.yuqi | 1 | 2/0/1 | 1 | 2 | 18 | |
Russell Bryant | 1 | 3/1/1 | 1 | 1 | 17 | |
Jee Jee Li | 1 | 2/2/0 | 2 | 2 | 16 | |
LI MOU | 1 | 0/0/0 | 1 | 2 | 13 | |
Elfie Guo | 1 | 0/0/0 | 1 | 2 | 12 | |
sasha0552 | 2 | 1/1/0 | 3 | 4 | 11 | |
shangmingc | 1 | 1/1/0 | 1 | 1 | 11 | |
Kuntai Du | 1 | 3/1/0 | 1 | 1 | 10 | |
盏一 | 1 | 1/1/0 | 1 | 1 | 9 | |
Vladislav Kruglikov | 1 | 1/1/0 | 1 | 2 | 9 | |
Blueyo0 | 1 | 1/1/0 | 1 | 1 | 8 | |
WANGWEI | 1 | 1/1/0 | 1 | 1 | 7 | |
Daniele | 1 | 3/2/0 | 2 | 2 | 7 | |
lewtun | 1 | 1/1/0 | 1 | 1 | 6 | |
Brian Li | 1 | 0/0/0 | 1 | 1 | 6 | |
tomeras91 | 1 | 1/1/0 | 2 | 2 | 5 | |
Ilya Lavrenov | 1 | 0/0/0 | 1 | 1 | 4 | |
Nicolò Lucchesi | 1 | 0/0/0 | 1 | 1 | 3 | |
Siyuan Liu | 1 | 0/0/0 | 1 | 1 | 2 | |
Philipp Schmid | 1 | 0/0/0 | 1 | 1 | 2 | |
Avshalom Manevich | 1 | 0/0/0 | 1 | 1 | 2 | |
Chris | 1 | 1/1/0 | 1 | 1 | 2 | |
Luis Vega | 1 | 1/1/0 | 1 | 1 | 1 | |
Philippe Lelièvre (Lap1n) | 0 | 1/0/0 | 0 | 0 | 0 | |
Cihan Yalçın (g-hano) | 0 | 1/0/0 | 0 | 0 | 0 | |
Jani Monoses (janimo) | 0 | 1/0/0 | 0 | 0 | 0 | |
Sungjae Lee (llsj14) | 0 | 2/0/0 | 0 | 0 | 0 | |
yulei (yuleil) | 0 | 1/0/0 | 0 | 0 | 0 | |
Alec Xiang (alxiang) | 0 | 1/0/1 | 0 | 0 | 0 | |
Chih-Chieh Yang (cyang49) | 0 | 1/0/0 | 0 | 0 | 0 | |
代君 (sydnash) | 0 | 1/0/0 | 0 | 0 | 0 | |
Will Eaton (wseaton) | 0 | 1/0/0 | 0 | 0 | 0 | |
xiaoqi (xq25478) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (zyddnys) | 0 | 1/0/0 | 0 | 0 | 0 | |
DefTruth (DefTruth) | 0 | 1/0/0 | 0 | 0 | 0 | |
axel7083 (axel7083) | 0 | 1/0/1 | 0 | 0 | 0 | |
Chirag Jain (chiragjn) | 0 | 1/0/0 | 0 | 0 | 0 | |
Joe Shajrawi (shajrawi) | 0 | 1/0/1 | 0 | 0 | 0 | |
Amit Garg (garg-amit) | 0 | 2/0/0 | 0 | 0 | 0 | |
Hanzhi Zhou (hanzhi713) | 0 | 1/0/0 | 0 | 0 | 0 | |
Wallas Henrique (wallashss) | 0 | 1/0/0 | 0 | 0 | 0 | |
Pastel! (Juelianqvq) | 0 | 1/0/0 | 0 | 0 | 0 | |
Ed Sealing (drikster80) | 0 | 1/0/1 | 0 | 0 | 0 | |
None (litianjian) | 0 | 1/0/0 | 0 | 0 | 0 | |
Lu Changqi (zeroorhero) | 0 | 1/0/0 | 0 | 0 | 0 | |
zhilong (Bye-legumes) | 0 | 1/0/0 | 0 | 0 | 0 | |
Chengyu Zhu (ChengyuZhu6) | 0 | 1/0/0 | 0 | 0 | 0 | |
Divakar Verma (divakar-amd) | 0 | 1/0/0 | 0 | 0 | 0 | |
Hongxia Yang (hongxiayang) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (jiqing-feng) | 0 | 2/0/0 | 0 | 0 | 0 | |
kk (kkHuang-amd) | 0 | 1/0/1 | 0 | 0 | 0 | |
Maximilien de Bayser (maxdebayser) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (niuzheng168) | 0 | 1/0/1 | 0 | 0 | 0 | |
Chenghao (Alan) Yang (yangalan123) | 0 | 1/0/0 | 0 | 0 | 0 | |
Le Xu (happyandslow) | 0 | 1/0/1 | 0 | 0 | 0 | |
None (saumya-saran) | 0 | 1/0/0 | 0 | 0 | 0 | |
tastelikefeet (tastelikefeet) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (congcongchen123) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (Alexei-V-Ivanov-AMD) | 0 | 2/0/2 | 0 | 0 | 0 | |
Varun Sundar Rabindranath (varun-sundar-rabindranath) | 0 | 2/0/0 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
The vLLM project currently has 1,575 open issues on GitHub, indicating a high level of ongoing activity and engagement from the community. Recent issues highlight a variety of challenges, including bugs related to model loading, performance regressions, and feature requests for enhanced functionality.
Notable themes include:
Issue #8641: [Bug]: Using FlashInfer with FP8 model with FP8 KV cache produces an error
Issue #8639: [Performance]: The accept rate of typical acceptance sampling
Issue #8638: [Bug]: loading embedding model intfloat/e5-mistral-7b-instruct results in a bind error
Issue #8636: [Usage]: Ray + vLLM OpenAI (offline) Batch Inference
Issue #8633: [Feature]: OpenAI o1-like Chain-of-thought (CoT) inference workflow
Issue #8629: [Bug]: memory leak
Issue #8628: [Bug]: Speculative decoding interferes with CPU-only execution
Issue #8627: [Bug]: MistralTokenizer Detokenization Issue
Issue #8626: [Usage]: doesn't work on pascal tesla P100
Issue #8625: [Bug]: Wrong "completion_tokens" counts in streaming usage
Overall, the recent activity reflects a vibrant community engaged in addressing both technical challenges and feature enhancements, but also points to areas where the project may need to improve stability and usability.
The analysis of the recent pull requests (PRs) for the vLLM project reveals a diverse range of contributions focusing on performance enhancements, bug fixes, and new feature implementations. Notable PRs include optimizations for specific models, improvements in caching mechanisms, and enhancements to the project's infrastructure and documentation.
atomic_add
from awq_gemm
, resulting in significant throughput and latency improvements.marlin_moe_ops.cu
, reflecting continuous maintenance efforts.best_of>1
, preventing unhandled exceptions and improving robustness.num_scheduler_steps > 1
, addressing a specific use case and enhancing flexibility.Several PRs focus on optimizing performance for specific hardware configurations or model types. For instance, PR #8646 addresses AMD GPU optimizations by modifying kernel implementations to eliminate bottlenecks caused by atomic operations. Similarly, PR #8645 leverages CUDA graphs to enhance throughput and reduce latency during multi-step chunked prefill operations.
Bug fixes are a recurring theme across the PRs. PR #8644 aims to resolve compatibility issues related to quantized models on specific hardware platforms, while PR #8637 addresses unhandled exceptions caused by configuration settings that were not previously accounted for in the logic.
New features and enhancements are also prominent in the recent PRs. For example, PR #8614 expands the capabilities of internvl models by allowing them to run with more than one scheduler step, thereby increasing their versatility. PR #8588 introduces support for FP8 MoE with compressed tensors, reflecting ongoing efforts to push the boundaries of model efficiency.
Efforts to improve the project's infrastructure are evident in PRs like #8583, which enhances the health check mechanism of MQLLMEngine, making it more resilient under heavy load conditions. Additionally, PRs focused on code cleanup and maintenance, such as PR #8643's removal of redundant code, contribute to the overall health and maintainability of the project.
Improvements in documentation and usability are highlighted in PRs like #8512, which adds a compatibility matrix for mutual exclusive features, aiding users in understanding feature interactions better.
The recent pull requests for the vLLM project demonstrate a robust development effort aimed at enhancing performance, fixing bugs, expanding features, and improving infrastructure. The community's active engagement is evident through these contributions, reflecting a commitment to advancing the capabilities of vLLM as a leading tool for large language model inference and serving.
bnellnm
alexeykondrat
simon-mo
Isotr0py
hidva
charlifu
njhill
jikunshang
KuntaiDu
ywang96
Other contributors (sroy745, tlrmchlsmth, joerunde, gshtras, shing100, russellb, afeldman-nm, alexm-neuralmagic, aarnphm, DarkLight1337, Jeffwan, dtrifiro, youkaichao, patrickvonplaten, chenqianfzh, ruisearch42, alex-jw-brooks, kevin314) also made significant contributions focusing on performance optimizations, bug fixes, and feature enhancements.
The development team is highly active with a diverse set of contributions aimed at enhancing the vLLM project’s performance and usability. The collaborative environment fosters rapid iteration on features while ensuring stability through rigorous testing and documentation practices.