vLLM, a high-throughput inference engine for large language models, is experiencing active development with a focus on hardware compatibility and continuous integration improvements. The project aims to optimize performance across various hardware setups, including GPUs and TPUs.
The vLLM project is designed to efficiently serve large language models, providing an easy-to-use framework that integrates with popular models and supports diverse hardware configurations.
Recent issues and pull requests indicate a strong emphasis on addressing performance bottlenecks and enhancing compatibility with different hardware environments. Notable issues include memory management challenges (#7656) and model inference inconsistencies (#7654). Pull requests such as #7696 aim to optimize GPU resource utilization by supporting parallel attention kernels.
AttentionState
abstraction.tie_word_embeddings
.Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
Michael Goin | 15 | 10/8/2 | 47 | 77 | 11296 | |
Jee Jee Li | 3 | 4/3/0 | 9 | 59 | 9316 | |
Isotr0py | 4 | 4/4/0 | 10 | 47 | 6641 | |
Cyrus Leung | 7 | 6/5/0 | 18 | 167 | 6344 | |
Lucas Wilkinson | 2 | 4/2/0 | 6 | 36 | 5471 | |
afeldman-nm | 1 | 3/0/1 | 1 | 33 | 4328 | |
Roger Wang | 6 | 3/3/0 | 20 | 62 | 3193 | |
Alexander Matveev | 4 | 2/2/0 | 6 | 33 | 2790 | |
youkaichao | 8 | 16/15/1 | 38 | 63 | 2567 | |
Alphi | 4 | 0/0/0 | 4 | 15 | 2244 | |
HandH1998 | 1 | 0/0/0 | 1 | 15 | 2047 | |
Varun Sundar Rabindranath (varun-sundar-rabindranath) | 2 | 1/0/0 | 4 | 10 | 1960 | |
Dipika Sikka | 1 | 7/4/1 | 7 | 26 | 1825 | |
Woosuk Kwon | 13 | 16/15/0 | 46 | 54 | 1695 | |
William Lin | 2 | 5/3/0 | 6 | 29 | 1412 | |
Luka Govedič | 1 | 2/1/0 | 2 | 12 | 1330 | |
Cody Yu | 4 | 1/1/0 | 5 | 26 | 1121 | |
SangBin Cho | 2 | 3/1/0 | 3 | 39 | 1094 | |
Nick Hill | 3 | 6/2/0 | 9 | 37 | 1067 | |
Antoni Baum | 2 | 2/2/0 | 4 | 23 | 986 | |
Simon Mo | 4 | 3/3/0 | 11 | 15 | 925 | |
Robert Shaw | 5 | 2/0/0 | 7 | 28 | 853 | |
Mor Zusman | 1 | 2/1/0 | 2 | 16 | 820 | |
Peter Salas | 1 | 2/1/0 | 1 | 24 | 721 | |
Jungho Christopher Cho | 1 | 0/0/0 | 1 | 3 | 703 | |
jon-chuang | 1 | 1/1/0 | 2 | 35 | 640 | |
Li, Jiang (bigPYJ1151) | 1 | 1/0/0 | 1 | 14 | 494 | |
Tyler Michael Smith | 5 | 2/0/0 | 12 | 21 | 489 | |
Peng Guanwen | 4 | 0/0/0 | 4 | 10 | 477 | |
Zhanghao Wu | 1 | 0/0/0 | 1 | 1 | 430 | |
Yihuan Bu | 1 | 0/0/0 | 1 | 9 | 364 | |
bnellnm | 1 | 5/4/0 | 4 | 16 | 355 | |
Mahesh Keralapura | 1 | 1/1/0 | 2 | 18 | 315 | |
Thomas Parnell | 4 | 0/0/0 | 4 | 18 | 314 | |
Travis Johnson | 4 | 1/1/0 | 5 | 12 | 275 | |
Kuntai Du | 3 | 3/2/0 | 4 | 13 | 260 | |
Sanger Steel | 2 | 0/0/0 | 2 | 3 | 257 | |
Zach Zheng | 2 | 0/0/0 | 2 | 8 | 250 | |
Cade Daniel | 2 | 1/1/0 | 5 | 9 | 223 | |
Jiaxin Shan | 1 | 0/0/0 | 1 | 11 | 219 | |
Charlie Fu | 1 | 0/0/0 | 1 | 7 | 213 | |
nunjunj | 1 | 0/0/0 | 1 | 4 | 197 | |
Kyle Sayers | 1 | 1/1/0 | 2 | 8 | 196 | |
dongmao zhang (thesues) | 1 | 1/0/1 | 1 | 8 | 182 | |
Abhinav Goyal | 1 | 1/1/0 | 1 | 6 | 180 | |
Pooya Davoodi | 1 | 2/0/0 | 1 | 5 | 162 | |
Evan Z. Liu | 1 | 0/0/0 | 1 | 7 | 157 | |
Grant Pinkert | 1 | 1/1/0 | 1 | 4 | 157 | |
Avshalom Manevich | 1 | 0/0/0 | 1 | 1 | 157 | |
Joe Runde | 1 | 2/0/0 | 1 | 8 | 150 | |
Siyuan Liu | 1 | 0/0/0 | 1 | 4 | 143 | |
zifeitong | 1 | 0/0/0 | 1 | 5 | 136 | |
xuyi | 1 | 0/0/0 | 1 | 3 | 128 | |
Kevin H. Luu | 4 | 9/5/2 | 9 | 6 | 127 | |
Daniele | 2 | 1/0/0 | 4 | 9 | 123 | |
Kunshang Ji | 1 | 4/4/0 | 4 | 7 | 108 | |
Zijian Hu | 1 | 0/0/0 | 1 | 30 | 106 | |
Lily Liu | 2 | 1/1/0 | 4 | 6 | 106 | |
Yehoshua Cohen | 1 | 0/0/0 | 1 | 2 | 90 | |
Joe | 1 | 0/0/0 | 1 | 10 | 86 | |
Rui Qiao | 1 | 4/1/1 | 4 | 7 | 77 | |
Ilya Lavrenov | 2 | 1/1/0 | 3 | 4 | 77 | |
shangmingc | 1 | 1/1/0 | 1 | 3 | 75 | |
Maximilien de Bayser | 1 | 0/0/0 | 1 | 3 | 72 | |
Besher Alkurdi | 1 | 1/1/0 | 1 | 2 | 69 | |
Stas Bekman | 1 | 0/0/0 | 1 | 1 | 66 | |
Murali Andoorveedu | 2 | 0/0/0 | 2 | 4 | 60 | |
Wallas Henrique | 1 | 1/1/0 | 1 | 7 | 59 | |
Alex Brooks | 1 | 1/1/0 | 1 | 3 | 54 | |
Ronen Schaffer | 1 | 0/0/0 | 1 | 4 | 50 | |
tomeras91 | 2 | 1/1/0 | 3 | 5 | 49 | |
Bongwon Jang | 2 | 0/0/0 | 2 | 2 | 49 | |
Aurick Qiao | 1 | 0/0/0 | 1 | 2 | 46 | |
Jeff Fialho | 1 | 0/0/0 | 1 | 2 | 29 | |
Kameshwara Pavan Kumar Mantha | 1 | 0/0/0 | 1 | 2 | 28 | |
Sage Moore | 1 | 0/0/0 | 1 | 8 | 26 | |
Chang Su | 2 | 0/0/0 | 2 | 4 | 24 | |
Elsa Granger | 1 | 0/0/0 | 1 | 1 | 19 | |
Alexei-V-Ivanov-AMD | 3 | 3/1/1 | 3 | 2 | 16 | |
Earthwalker | 1 | 0/0/0 | 1 | 2 | 13 | |
omrishiv | 2 | 1/0/0 | 2 | 5 | 13 | |
zhaotyer | 1 | 0/0/0 | 1 | 1 | 9 | |
fzyzcjy | 2 | 1/1/0 | 2 | 2 | 8 | |
sasha0552 | 1 | 1/1/0 | 1 | 1 | 7 | |
Gordon Wong | 1 | 1/1/0 | 1 | 1 | 7 | |
Hongxia Yang | 1 | 0/0/0 | 1 | 1 | 7 | |
Fei | 1 | 0/0/0 | 1 | 1 | 6 | |
Anthony Platanios | 1 | 0/0/0 | 1 | 1 | 6 | |
Jae-Won Chung | 1 | 0/0/0 | 1 | 2 | 6 | |
Aditya Paliwal | 1 | 0/0/0 | 1 | 1 | 6 | |
Qingquan Song | 1 | 0/0/0 | 1 | 1 | 6 | |
xiaobochen123 | 1 | 0/0/0 | 1 | 1 | 6 | |
AllenDou (AllenDou) | 1 | 1/0/1 | 1 | 2 | 5 | |
PHILO-HE | 1 | 1/0/1 | 1 | 3 | 5 | |
Jacob Schein | 1 | 0/0/0 | 1 | 1 | 5 | |
Cherilyn Buren | 1 | 0/0/0 | 1 | 1 | 5 | |
jack | 1 | 1/1/0 | 1 | 1 | 5 | |
Ali Panahi | 1 | 0/0/0 | 1 | 1 | 4 | |
Rafael Vasquez | 1 | 0/0/0 | 1 | 1 | 4 | |
jianyizh | 1 | 1/1/0 | 1 | 1 | 4 | |
Jie Fu (傅杰) | 1 | 0/0/0 | 1 | 2 | 4 | |
None (chenqianfzh) | 1 | 1/0/0 | 1 | 1 | 4 | |
Andrew Wang | 1 | 1/1/0 | 1 | 2 | 3 | |
Harry Mellor | 1 | 0/0/0 | 1 | 2 | 3 | |
Atilla Akkuş | 1 | 0/0/0 | 1 | 1 | 3 | |
Cheng Li | 1 | 0/0/0 | 1 | 1 | 2 | |
Xander Johnson | 1 | 1/1/0 | 1 | 1 | 2 | |
Noam Gat | 1 | 0/0/0 | 1 | 1 | 2 | |
LF Marques | 1 | 0/0/0 | 1 | 1 | 2 | |
liuyhwangyh | 1 | 0/0/0 | 1 | 1 | 2 | |
Gurpreet Singh Dhami | 1 | 0/0/0 | 1 | 1 | 2 | |
Andrew Song | 1 | 1/1/0 | 1 | 1 | 1 | |
Katarzyna Papis | 1 | 0/0/0 | 1 | 1 | 1 | |
omkar kakarparthi | 1 | 0/0/0 | 1 | 1 | 1 | |
Xiaoyu Zhang (BBuf) | 0 | 1/0/1 | 0 | 0 | 0 | |
Yangshen⚡Deng (TKONIY) | 0 | 1/0/0 | 0 | 0 | 0 | |
Wenxiang (wenxcs) | 0 | 1/0/0 | 0 | 0 | 0 | |
Roy (esmeetu) | 0 | 1/0/0 | 0 | 0 | 0 | |
Gregory Shtrasberg (gshtras) | 0 | 2/0/0 | 0 | 0 | 0 | |
Libin Tang (libinta) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (rasmith) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (sroy745) | 0 | 2/0/1 | 0 | 0 | 0 | |
Makadamia (alexw994) | 0 | 1/0/0 | 0 | 0 | 0 | |
Tyler Rockwood (rockwotj) | 0 | 1/0/0 | 0 | 0 | 0 | |
Shawn Tan (shawntan) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (tjandy98) | 0 | 1/0/0 | 0 | 0 | 0 | |
Yuyi Ao (George-ao) | 0 | 1/0/1 | 0 | 0 | 0 | |
None (WanXiaopei) | 0 | 1/0/0 | 0 | 0 | 0 | |
zhrrr (izhuhaoran) | 0 | 1/0/0 | 0 | 0 | 0 | |
LI MOU (learninmou) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (speggioale) | 0 | 1/0/0 | 0 | 0 | 0 | |
lcq (zeroorhero) | 0 | 2/0/1 | 0 | 0 | 0 | |
Nadav Shmayovits (NadavShmayo) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (niuzheng168) | 0 | 1/0/1 | 0 | 0 | 0 | |
Richard Liu (richardsliu) | 0 | 1/0/0 | 0 | 0 | 0 | |
None (alexeykondrat) | 0 | 1/0/0 | 0 | 0 | 0 | |
alan yang (cassiewilliam) | 0 | 1/0/0 | 0 | 0 | 0 | |
Prashant Gupta (prashantgupta24) | 0 | 1/0/1 | 0 | 0 | 0 | |
Vladislav Kruglikov (vladislavkruglikov) | 0 | 6/0/5 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 114 | 38 | 253 | 0 | 1 |
14 Days | 239 | 100 | 568 | 0 | 1 |
30 Days | 392 | 192 | 1122 | 0 | 1 |
All Time | 4034 | 2657 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
The vllm-project/vllm repository currently has 1,377 open issues, with notable recent activity including the creation of several new issues and discussions around bugs, usage questions, and feature requests. A significant number of issues are related to model performance, quantization problems, and compatibility with various hardware setups.
A recurring theme in the recent issues is the struggle with model inference performance, particularly regarding memory management and GPU utilization. There are also multiple reports of specific models failing to load or generate responses correctly, indicating potential underlying issues with model support or configuration.
Issue #7704: [Bug]: errors when loading mixtral 8x7b
Issue #7702: [Usage]: alignment between trl and llm.generate
Issue #7700: [Usage]: how to abort request?
Issue #7699: [Bug]: vLLM inconsistently crashes on startup for multinode cluster
Issue #7697: [RFC]: Enable Memory Tiering for vLLM
Issue #7656: [Misc]: OOM (CUDA Out Of Memory) when running LLMs in WSL using vLLM
Issue #7655: [Misc]: Virtual Office Hours: August 8 and August 21
Issue #7654: [Bug]: Gemma2 models inference using vLLM 0.5.4 produces incorrect responses
Issue #7653: [Bug]: Error happened with Large scale requests based on 0.5.4 vllm
Issue #7652: [Bug]: The error is caused by RuntimeError during inference.
Overall, the current state of issues reflects a community actively seeking solutions to performance bottlenecks and stability concerns, especially in multi-GPU and distributed environments.
The dataset contains a collection of 83 open pull requests (PRs) from the vLLM project, which is focused on providing an efficient inference and serving engine for large language models. The PRs cover a wide range of topics including bug fixes, new features, performance optimizations, and enhancements for model compatibility.
PR #7708: Fix ShardedStateLoader for vllm fp8 quantization - A recent PR aimed at fixing issues with the ShardedStateLoader related to fp8 quantization.
PR #7707: [Core] Pipe worker_class_fn
argument in Executor - Introduces an oversight fix allowing the worker_class_fn
API to be used when subclassing executors.
PR #7706: [Spec Decoding] Use target model max length as default for draft model - Adjusts the default maximum length for draft models to match the target model's length.
PR #7705: [NOT-FOR-REVIEW] Refactor Dockerfile - A refactor of the Dockerfile, currently not intended for review.
PR #7703: [multi-step] Raise error if not using async engine - Implements an error raise mechanism if the async engine is not being utilized.
PR #7701: [WIP, Kernel] (2/N) Machete - Integrate into GPTQMarlinLinearMethod and CompressedTensorsWNA16 - Work in progress for integrating Machete into existing methods.
PR #7698: [BugFix] Raise all exception variations in async generator - Addresses exceptions raised in async generators to ensure all variations are caught.
PR #7696: [Kernel] Support prefill and decode attention kernel in parallel - Enhances GPU resource utilization by supporting parallel attention kernels during prefill.
PR #7691: [Model][Kernel][Bugfix] Commits for new MSFT PhiMoE model - Introduces a new model and fixes a bug related to LongRoPE.
PR #7672: [Model][Bugfix] Add glm-4v Model and Fix bnb Quantization Issue - Integrates glm-4v model and resolves quantization issues affecting its performance.
PR #7666: [Frontend][Core] Move logits processor construction to engine - Refactors logits processor construction to improve performance and simplify user experience.
PR #7658: [Kernel] Add opcheck tests for punica kernels - Introduces tests for punica custom operations to ensure correctness.
PR #7654: [Frontend] add json_schema support from OpenAI protocol - Adds support for json_schema in OpenAI protocol requests.
PR #7652: [Core] Logprobs support in Multi-step - Introduces logprobs support within multi-step processing.
PR #7648: [Core]
Added streaming support to LLM
Class - Implements streaming capabilities in the LLM class's generate method.
PR #7643: [WIP][SPMD] Support spec decoding - Adds support for speculative decoding with SPMD architecture.
PR #7631: [Encoder decoder][WIP] Add cuda graph support during decoding for encoder-decoder models - Draft PR adding CUDA graph support during decoding phases.
PR #7615: [Model] Add UltravoxModel and UltravoxConfig - Integrates Ultravox speech model into vLLM.
PR #7613: Support vLLM single and multi-host TPUs on GKE - Fixes issues related to TPU usage on GKE with RayServe.
PR #7598: [Build/CI] Empty commit. Testing the present CI state - A placeholder PR intended to test CI functionality.
PR #7597: [Misc] Add logging for engine and executor cleanup - Introduces logging enhancements to monitor engine and executor shutdown processes.
PR #7585: [Kernel][LoRA] Add assertion for punica sgmv kernels - Adds assertions to prevent incorrect calculations in Punica SGMV kernels.
PR #7584: [Ray backend] Better error when pg topology is bad - Improves error handling related to placement group topology issues in Ray backend.
PR #7568: [Prototype] Create and use custom NCCL group for aDAG - Draft PR introducing custom NCCL groups for better DAG handling.
PR #7565: [Tests] Disable retries and use context manager for openai client - Updates testing practices around the OpenAI client library to improve reliability.
PR #7564: [Kernel] Use mutable_data_ptr or const_data_ptr instead of data_ptr - Refines data pointer usage across the codebase to improve clarity regarding mutability.
PR #7563: Varun/multi step chunked prefill - Draft PR focused on implementing multi-step chunked prefill functionality.
PR #7562: [Bugfix] neuron: enable tensor parallelism - Enables tensor parallelism on neuron devices with updated block sizes.
PR #7559: [model] Support for Llava-Next-Video model - Adds support for Llava-Next-Video model integration into vLLM framework.
PR #7549: [WIP] Store speculative states - Draft PR introducing storage capabilities for speculative decoding states within the API server logic.
The current dataset of pull requests reflects a diverse range of ongoing developments within the vLLM project, showcasing active engagement from contributors across various aspects of the codebase, including core functionalities, model integrations, performance optimizations, and bug fixes.
A significant theme among these pull requests is performance enhancement, particularly regarding memory management and computational efficiency in handling large language models (LLMs). For example, PRs like #7696 focus on optimizing attention mechanisms by supporting parallel processing, while others like PRs #7654 and #7648 emphasize improving API interactions through better logging and request handling mechanisms.
Another recurring theme is the introduction of new models or enhancements to existing ones, as seen in PRs like #7691 (PhiMoE) and PRs like #7559 (Llava-Next-Video). This indicates a strong focus on expanding the capabilities of vLLM by integrating cutting-edge models that can leverage its architecture effectively.
There are several instances of "do-not-merge" or "WIP" labels across various pull requests (#7705, #7701, etc.), indicating that contributors are actively working on features that may still require further refinement or testing before they can be integrated into the main branch. This suggests a healthy development process where contributors are encouraged to iterate on their work before finalizing it for production use.
Despite a high volume of open pull requests (383), there appears to be a lack of recent merge activity across many of them, which could indicate potential bottlenecks in the review process or resource constraints within the team responsible for managing these contributions. This could lead to delays in implementing important features or fixes that have already been developed but remain unmerged due to review backlogs or prioritization challenges within the project team.
Some pull requests have sparked discussions among contributors regarding implementation details or design choices (e.g., PRs like #7666 regarding logits processor construction). These discussions highlight an active community engaged in collaborative decision-making processes, which is crucial for maintaining code quality and ensuring that changes align with project goals.
Overall, this dataset illustrates a vibrant development environment within the vLLM project where contributors are actively working towards enhancing performance, expanding functionality, and addressing bugs effectively. However, attention should be given to managing merge activities more efficiently to capitalize on this momentum and ensure timely delivery of improvements to users relying on vLLM's capabilities.
Kunshang Ji (jikunshang)
Antoni Baum (Yard1)
AttentionState
abstraction, impacting multiple files in the attention module.Lucas Wilkinson (LucasWilkinson)
Ronen Schaffer (ronensc)
Isotr0py
Ilya Lavrenov (ilya-lavrenov)
Youkaichao
Jianyizh
Zijian Hu (zijian-hu)
tie_word_embeddings
support across models.Kevin H. Luu (khluu)
Travis Johnson (tjohnson31415)
Woosuk Kwon (WoosukKwon)
Overall, the development team demonstrates a robust approach to maintaining and enhancing the vLLM project through collaborative efforts across various domains of expertise.