OSS Report: vllm-project/vllm

Sept. 20, 2024, 12:30 a.m. UTC This report was generated by Dispatch AI

vLLM Project Faces Critical Bug with FP8 Models Amid Ongoing Performance Optimizations

vLLM, a high-throughput inference engine for large language models, is experiencing a critical bug affecting FP8 model configurations, while continuing to push performance boundaries through kernel optimizations and enhanced quantization support.

Recent activities highlight significant efforts in performance optimization, particularly for AMD GPUs and CUDA graphs. However, a critical issue (#8641) with FP8 models using FlashInfer indicates potential disruptions for users relying on these configurations. The development team remains active, with contributions focused on bug fixes, feature enhancements, and infrastructure improvements.

Recent Activity

Issues and PRs

Recent issues reveal ongoing challenges with performance and compatibility. Key issues include:

#8641: Critical bug with FP8 models using FlashInfer.
#8639: Performance concerns with acceptance sampling.
#8638: Compatibility issue with embedding model loading.

These issues suggest a need for improved stability and documentation to support diverse configurations.

Development Team Activity

ywang96
- Commits: 17
- Changes: 877 across 21 files.
- Focus: Bug fixes and model enhancements.
njhill
- Commits: 15
- Changes: 2562 across 45 files.
- Focus: Async engine updates and multi-step processing.
Isotr0py
- Commits: 18
- Changes: 2311 across 37 files.
- Focus: Model support and quantization methods.
simon-mo
- Commits: 12
- Changes: 1235 across 18 files.
- Focus: Documentation updates and feature enhancements.
charlifu
- Commits: 2
- Changes: 1916 across 8 files.
- Focus: Custom paged attention kernels for ROCm.

The team is actively enhancing performance through kernel optimizations and addressing bugs to improve user experience.

Of Note

Critical bug #8641 affecting FP8 models could impact user adoption if not resolved promptly.
Significant performance optimizations for AMD GPUs (#8646) demonstrate ongoing hardware-specific enhancements.
Enhanced CUDA graph usage (#8645) reflects efforts to improve throughput and latency.
Introduction of FP8 MoE support (#8588) indicates advancements in model efficiency.
Improved health check mechanisms (#8583) enhance system reliability under load conditions.

Quantified Reports

Quantify Issues

Recent GitHub Issues Activity

Timespan	Opened	Closed	Comments	Labeled	Milestones
7 Days	98	49	275	0	1
14 Days	215	90	749	0	1
30 Days	362	173	1245	0	1
All Time	4497	2922	-	-	-

_{Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.}

Quantify commits

Quantified Commit Activity Over 30 Days

Developer	Branches	PRs	Commits	Files	Changes
Dipika Sikka	5	5/3/1	11	47	6998
Cyrus Leung	4	9/8/0	20	175	5154
Lucas Wilkinson (LucasWilkinson)	1	1/0/0	2	29	4831
Alexander Matveev	2	5/4/0	7	47	3733
Mor Zusman (mzusman)	1	1/0/0	1	20	2846
Kyle Mistele	1	2/2/0	3	26	2765
Nick Hill	4	6/6/0	15	45	2562
Tyler Michael Smith	2	2/1/0	14	24	2558
Michael Goin	6	3/1/0	8	21	2525
Lily Liu	1	1/1/0	2	21	2388
Isotr0py	4	9/9/0	18	37	2311
Patrick von Platen	2	4/3/0	5	26	2024
Peter Salas	4	2/0/0	4	41	1973
Charlie Fu	1	2/2/0	2	8	1916
Alex Brooks	1	2/2/0	3	11	1774
afeldman-nm	2	1/0/0	3	106	1631
Yang Fan	1	0/0/0	1	14	1562
youkaichao	4	10/10/0	22	48	1504
Wenxiang	1	1/1/0	2	13	1339
Simon Mo	2	8/8/0	12	18	1235
Yangshen⚡Deng	1	0/0/0	1	21	1101
Robert Shaw	6	0/0/0	15	23	878
Roger Wang	6	6/6/0	17	21	877
bnellnm	2	4/1/1	4	32	846
Geun, Lim	1	1/1/0	1	6	835
Megha Agarwal	1	0/0/0	1	21	834
Yohan Na	1	0/0/0	1	6	825
sroy745	1	1/1/0	3	17	797
Shawn Tan	1	1/0/0	1	4	792
William Lin	2	2/1/0	8	29	781
Luka Govedič	2	1/0/0	2	16	749
Li, Jiang	1	0/0/0	1	18	729
chenqianfzh	2	1/1/0	2	6	725
Woosuk Kwon	6	4/3/0	17	16	693
Cody Yu	2	2/2/0	7	18	672
Antoni Baum	2	0/0/0	3	20	667
Pavani Majety	2	0/0/0	3	11	639
ElizaWszola	1	0/0/0	1	12	638
Jungho Christopher Cho	1	0/0/0	1	9	629
rasmith	2	4/1/1	2	5	528
Jiaxin Shan	1	1/1/0	2	11	357
Harsha vardhan manoj Bikki	2	0/0/0	2	9	329
ywfang	1	1/1/0	1	7	318
Kunshang Ji	3	3/2/0	5	15	307
None (zifeitong)	1	1/0/0	1	7	226
Kaunil Dhruv	1	0/0/0	1	7	186
Wei-Sheng Chin	2	2/1/0	2	1	159
Joe Runde	1	6/4/0	5	9	138
Maureen McElaney	1	0/0/0	1	1	128
Aaron Pham	1	2/1/0	1	27	127
Alexey Kondratiev(AMD)	2	5/4/1	7	6	127
manikandan.tm@zucisystems.com	1	0/0/0	1	6	125
Prashant Gupta	1	0/0/0	1	3	103
Gregory Shtrasberg	2	1/1/0	2	5	100
Kyle Sayers	1	0/0/0	1	4	79
Richard Liu	1	0/0/0	1	5	78
Kevin Lin	1	3/3/0	3	6	72
sumitd2	1	2/1/0	2	2	56
omrishiv	1	0/0/0	1	3	55
Adam Lugowski	1	0/0/0	1	1	54
Kevin H. Luu	1	3/0/2	3	4	51
Ronen Schaffer	1	0/0/0	1	4	50
Jonathan Berkhahn	1	0/0/0	1	9	49
TimWang	1	1/0/0	1	2	48
Aarni Koskela	1	1/1/0	1	3	43
Rui Qiao	1	1/1/0	2	5	39
kushanam	1	0/0/0	1	1	38
Pooya Davoodi	1	1/1/0	1	2	35
wnma	1	0/0/0	1	1	27
Jonas M. Kübler	1	0/0/0	1	2	25
Stas Bekman	1	0/0/0	2	2	22
wang.yuqi	1	2/0/1	1	2	18
Russell Bryant	1	3/1/1	1	1	17
Jee Jee Li	1	2/2/0	2	2	16
LI MOU	1	0/0/0	1	2	13
Elfie Guo	1	0/0/0	1	2	12
sasha0552	2	1/1/0	3	4	11
shangmingc	1	1/1/0	1	1	11
Kuntai Du	1	3/1/0	1	1	10
盏一	1	1/1/0	1	1	9
Vladislav Kruglikov	1	1/1/0	1	2	9
Blueyo0	1	1/1/0	1	1	8
WANGWEI	1	1/1/0	1	1	7
Daniele	1	3/2/0	2	2	7
lewtun	1	1/1/0	1	1	6
Brian Li	1	0/0/0	1	1	6
tomeras91	1	1/1/0	2	2	5
Ilya Lavrenov	1	0/0/0	1	1	4
Nicolò Lucchesi	1	0/0/0	1	1	3
Siyuan Liu	1	0/0/0	1	1	2
Philipp Schmid	1	0/0/0	1	1	2
Avshalom Manevich	1	0/0/0	1	1	2
Chris	1	1/1/0	1	1	2
Luis Vega	1	1/1/0	1	1	1
Philippe Lelièvre (Lap1n)	0	1/0/0	0	0	0
Cihan Yalçın (g-hano)	0	1/0/0	0	0	0
Jani Monoses (janimo)	0	1/0/0	0	0	0
Sungjae Lee (llsj14)	0	2/0/0	0	0	0
yulei (yuleil)	0	1/0/0	0	0	0
Alec Xiang (alxiang)	0	1/0/1	0	0	0
Chih-Chieh Yang (cyang49)	0	1/0/0	0	0	0
代君 (sydnash)	0	1/0/0	0	0	0
Will Eaton (wseaton)	0	1/0/0	0	0	0
xiaoqi (xq25478)	0	1/0/0	0	0	0
None (zyddnys)	0	1/0/0	0	0	0
DefTruth (DefTruth)	0	1/0/0	0	0	0
axel7083 (axel7083)	0	1/0/1	0	0	0
Chirag Jain (chiragjn)	0	1/0/0	0	0	0
Joe Shajrawi (shajrawi)	0	1/0/1	0	0	0
Amit Garg (garg-amit)	0	2/0/0	0	0	0
Hanzhi Zhou (hanzhi713)	0	1/0/0	0	0	0
Wallas Henrique (wallashss)	0	1/0/0	0	0	0
Pastel！ (Juelianqvq)	0	1/0/0	0	0	0
Ed Sealing (drikster80)	0	1/0/1	0	0	0
None (litianjian)	0	1/0/0	0	0	0
Lu Changqi (zeroorhero)	0	1/0/0	0	0	0
zhilong (Bye-legumes)	0	1/0/0	0	0	0
Chengyu Zhu (ChengyuZhu6)	0	1/0/0	0	0	0
Divakar Verma (divakar-amd)	0	1/0/0	0	0	0
Hongxia Yang (hongxiayang)	0	1/0/0	0	0	0
None (jiqing-feng)	0	2/0/0	0	0	0
kk (kkHuang-amd)	0	1/0/1	0	0	0
Maximilien de Bayser (maxdebayser)	0	1/0/0	0	0	0
None (niuzheng168)	0	1/0/1	0	0	0
Chenghao (Alan) Yang (yangalan123)	0	1/0/0	0	0	0
Le Xu (happyandslow)	0	1/0/1	0	0	0
None (saumya-saran)	0	1/0/0	0	0	0
tastelikefeet (tastelikefeet)	0	1/0/0	0	0	0
None (congcongchen123)	0	1/0/0	0	0	0
None (Alexei-V-Ivanov-AMD)	0	2/0/2	0	0	0
Varun Sundar Rabindranath (varun-sundar-rabindranath)	0	2/0/0	0	0	0

_{PRs: created by that dev and opened/merged/closed-unmerged during the period}

Detailed Reports

Report On: Fetch issues

Recent Activity Analysis

The vLLM project currently has 1,575 open issues on GitHub, indicating a high level of ongoing activity and engagement from the community. Recent issues highlight a variety of challenges, including bugs related to model loading, performance regressions, and feature requests for enhanced functionality.

Notable themes include:

Performance Issues: Several users report slowdowns or unexpected behavior when using specific configurations, such as speculative decoding or chunked prefill.
Compatibility Concerns: Users are experiencing difficulties with certain models and configurations, particularly around quantization and multi-GPU setups.
Feature Requests: There is a demand for additional features like improved logging and better support for multi-modal inputs.

Issue Details

Most Recently Created Issues

Issue #8641: [Bug]: Using FlashInfer with FP8 model with FP8 KV cache produces an error
- Priority: High
- Status: Open
- Created: 0 days ago
- Update: No updates yet
Issue #8639: [Performance]: The accept rate of typical acceptance sampling
- Priority: Medium
- Status: Open
- Created: 0 days ago
- Update: No updates yet
Issue #8638: [Bug]: loading embedding model intfloat/e5-mistral-7b-instruct results in a bind error
- Priority: High
- Status: Open
- Created: 0 days ago
- Update: No updates yet
Issue #8636: [Usage]: Ray + vLLM OpenAI (offline) Batch Inference
- Priority: Low
- Status: Open
- Created: 0 days ago
- Update: No updates yet
Issue #8633: [Feature]: OpenAI o1-like Chain-of-thought (CoT) inference workflow
- Priority: Medium
- Status: Open
- Created: 1 day ago
- Update: No updates yet

Most Recently Updated Issues

Issue #8629: [Bug]: memory leak
- Priority: High
- Status: Open
- Last Updated: 0 days ago
Issue #8628: [Bug]: Speculative decoding interferes with CPU-only execution
- Priority: Medium
- Status: Open
- Last Updated: 0 days ago
Issue #8627: [Bug]: MistralTokenizer Detokenization Issue
- Priority: Medium
- Status: Open
- Last Updated: 0 days ago
Issue #8626: [Usage]: doesn't work on pascal tesla P100
- Priority: Low
- Status: Open
- Last Updated: 0 days ago
Issue #8625: [Bug]: Wrong "completion_tokens" counts in streaming usage
- Priority: Medium
- Status: Open
- Last Updated: 0 days ago

Analysis of Notable Issues

The issue regarding the use of FlashInfer with FP8 models (#8641) indicates a critical bug that could affect many users relying on this configuration for performance optimization.
Performance-related issues (#8639) suggest that users are facing challenges in achieving expected throughput, which could hinder adoption and user satisfaction.
Compatibility issues with specific models (#8638) highlight the need for better documentation and support for various configurations, especially for new users.

Overall, the recent activity reflects a vibrant community engaged in addressing both technical challenges and feature enhancements, but also points to areas where the project may need to improve stability and usability.

Report On: Fetch pull requests

Overview

The analysis of the recent pull requests (PRs) for the vLLM project reveals a diverse range of contributions focusing on performance enhancements, bug fixes, and new feature implementations. Notable PRs include optimizations for specific models, improvements in caching mechanisms, and enhancements to the project's infrastructure and documentation.

Summary of Pull Requests

PR #8646: Optimization for AMD GPUs by removing atomic_add from awq_gemm, resulting in significant throughput and latency improvements.
PR #8645: Enhancements to CUDA graphs for multi-step chunked prefill, improving performance metrics across various configurations.
PR #8644: Draft PR addressing HIPBLAS_STATUS_NOT_SUPPORTED error for specific quantized models, indicating ongoing efforts to improve compatibility and performance.
PR #8643: Minor cleanup by removing unnecessary code in marlin_moe_ops.cu, reflecting continuous maintenance efforts.
PR #8640: Fix for edge case in mistral tokenizer related to non-UTF unicode tokens, showcasing attention to detail in handling diverse input scenarios.
PR #8637: Disabling multi-step speculation when best_of>1, preventing unhandled exceptions and improving robustness.
PR #8614: Enabling internvl running with num_scheduler_steps > 1, addressing a specific use case and enhancing flexibility.
PR #8588: Support for FP8 MoE with compressed tensors, indicating advancements in model efficiency and performance.
PR #8583: Improvement in MQLLMEngine's health check mechanism by implementing a heartbeat loop instead of traditional health checks, enhancing reliability under load.

Analysis of Pull Requests

Performance Optimizations

Several PRs focus on optimizing performance for specific hardware configurations or model types. For instance, PR #8646 addresses AMD GPU optimizations by modifying kernel implementations to eliminate bottlenecks caused by atomic operations. Similarly, PR #8645 leverages CUDA graphs to enhance throughput and reduce latency during multi-step chunked prefill operations.

Bug Fixes and Compatibility Improvements

Bug fixes are a recurring theme across the PRs. PR #8644 aims to resolve compatibility issues related to quantized models on specific hardware platforms, while PR #8637 addresses unhandled exceptions caused by configuration settings that were not previously accounted for in the logic.

Enhancements and New Features

New features and enhancements are also prominent in the recent PRs. For example, PR #8614 expands the capabilities of internvl models by allowing them to run with more than one scheduler step, thereby increasing their versatility. PR #8588 introduces support for FP8 MoE with compressed tensors, reflecting ongoing efforts to push the boundaries of model efficiency.

Infrastructure and Maintenance

Efforts to improve the project's infrastructure are evident in PRs like #8583, which enhances the health check mechanism of MQLLMEngine, making it more resilient under heavy load conditions. Additionally, PRs focused on code cleanup and maintenance, such as PR #8643's removal of redundant code, contribute to the overall health and maintainability of the project.

Documentation and Usability

Improvements in documentation and usability are highlighted in PRs like #8512, which adds a compatibility matrix for mutual exclusive features, aiding users in understanding feature interactions better.

Conclusion

The recent pull requests for the vLLM project demonstrate a robust development effort aimed at enhancing performance, fixing bugs, expanding features, and improving infrastructure. The community's active engagement is evident through these contributions, reflecting a commitment to advancing the capabilities of vLLM as a leading tool for large language model inference and serving.

Report On: Fetch commits

Repo Commits Analysis

Development Team and Recent Activity

Team Members and Their Recent Activities

bnellnm
- Commits: 4
- Changes: 846 across 32 files.
- Branches: 2
- Notable Work: Focused on kernel optimizations and bug fixes.
alexeykondrat
- Commits: 7
- Changes: 127 across 6 files.
- Branches: 2
- Notable Work: Contributions to CI/build processes and ROCm support.
simon-mo
- Commits: 12
- Changes: 1235 across 18 files.
- Branches: 2
- Notable Work: Documentation updates and feature enhancements.
Isotr0py
- Commits: 18
- Changes: 2311 across 37 files.
- Branches: 4
- Notable Work: Extensive contributions to model support and quantization methods.
hidva
- Commits: 1
- Changes: 9 across 1 file.
- Branches: 1
- Notable Work: Minor changes related to logits processing.
charlifu
- Commits: 2
- Changes: 1916 across 8 files.
- Branches: 1
- Notable Work: Major updates to custom paged attention kernels for ROCm.
njhill
- Commits: 15
- Changes: 2562 across 45 files.
- Branches: 4
- Notable Work: Significant updates to async engine and multi-step processing.
jikunshang
- Commits: 5
- Changes: 307 across 15 files.
- Branches: 3
- Notable Work: Bug fixes and CI improvements.
KuntaiDu
- Commits: 1
- Changes: 10 across 1 file.
- Branches: 1
- Notable Work: Minor bug fix related to benchmarks.
ywang96
- Commits: 17
- Changes: 877 across 21 files.
- Branches: 6
- Notable Work: Various bug fixes and model enhancements.
Other contributors (sroy745, tlrmchlsmth, joerunde, gshtras, shing100, russellb, afeldman-nm, alexm-neuralmagic, aarnphm, DarkLight1337, Jeffwan, dtrifiro, youkaichao, patrickvonplaten, chenqianfzh, ruisearch42, alex-jw-brooks, kevin314) also made significant contributions focusing on performance optimizations, bug fixes, and feature enhancements.

Patterns and Themes

The team is actively working on performance optimization through kernel enhancements and quantization methods.
There is a strong focus on improving CI/CD processes with multiple members contributing to build configurations and testing frameworks.
Collaboration is evident with several co-authored commits indicating teamwork on complex features or bug fixes.
Documentation updates accompany code changes frequently, reflecting a commitment to maintaining clarity for users and contributors alike.
Bug fixing is a recurring theme with many developers addressing issues in various components of the project.

Conclusions

The development team is highly active with a diverse set of contributions aimed at enhancing the vLLM project’s performance and usability. The collaborative environment fosters rapid iteration on features while ensuring stability through rigorous testing and documentation practices.