TensorRT-LLM is a project that focuses on optimizing large language models for inference on NVIDIA GPUs. While the project seems to be under active development, it also has some pending issues and concerns that can impact its overall stability and usefulness.
TensorRT-LLM has potential given its usefulness for launching large language models on NVIDIA GPUs. However, its continued success hinges on developers' efforts to:
Overall, addressing these issues could enhance TensorRT-LLM's credibility as a reliable toolbox and garner further interest from the open-source community.
The primary themes that revolve around the issues in this software project include:
Installation and Build Issues: Majority of the issues revolve around installation and building challenges. Users seem to be facing problems in building the project from the source(#32). Building failures, particular issues with Docker, and compatibility issues with NVIDIA drivers(#23) also seem commonplace. There are problems regarding building the software in different environments such as AWS(#32), Windows(#18), different LINUX distributions, and in different Docker containers(#22). One issue reports about specific CUDA version requirement(#45).
Model and Performance related Issues: These issues mostly revolve around the compatibility, functionality, and performance of various models with TensorRT-LLM. For example, issues (#24, #49, #47, #29, #27) have been reported about GPT-2, vLLM, Mistral 7B, and RWKV's compatibility and performance. Bug related to output in GPT2 example was reported(#53).
Dependency Problems: Particularly, in Issue (#16), a user highlights an outdated dependency (transformers==4.31.0) which is causing problems.
Other Problems: These include unclear guidance about how to run TritonServer (#39), wrong outputs in examples (#53, 37), and request for new releases or wheels (#49, 18, 52).
One of the significant problems appears to be the compatibility issues with certain drivers, CUDA versions(#23, #45) and working across different platforms(#22, #18). The build and installation issues expose areas that are needed for more robust testing(#32).
Performance issues (#29, #24) raise concerns about the tool's scalability and efficiency. Inadequate documentation or lack of clear guidance that is causing issues(#52, #39) suggests the need for better user-guides, exemplars, or tutorials.
Major uncertainties lie with the project's compatibility with models and necessary software dependencies utilized by the users (such as PyTorch and different GPT variants) (#27, #49, #47, #16). The compatibility and performance of various models with the tool is uncertain.
Another uncertainty lies in unknown response time for resolving these issues, particularly those affecting the tool’s usability in users' environments.
Worrying anomalies include the problem with output results (#53, #37) and the issue of specific build running out of memory (#29). These issues highlight potential defects in the software that will require attention to ensure it functions as expected.
Another worrying issue is out-of-date dependencies, implying that the software may not be up to date or optimized with the latest libraries(#16).
The newly identified issue seems to be related to Speed comparison with vllm(#24), indicating a potentially new user requirement or comparison base that the current project may not fully cater to.
Comparing open and closed issues, it appears that most of the issues are still open. The closed issues primarily appear to revolve around build and installation issues, incompatibility with certain GPUs or models, and requests for certain features. Majority of the issues seem to be centered on difficulties with installation and building, compatibility issues with dependence, and specific model support problems.
Thus, it seems that the core problems spotted in resolved issues are still recurring and have not been fully addressed.
Fix link jump in windows readme.md: This pull request resolves an issue with a hyperlink in the windows readme.md file that jumps to an incorrect location. An active discussion is present, with appreciation from contributors for the fix. The stated intention is to merge this after handling synchronization differences between the internal repository and the release branch that directly merging may cause.
fix Forward Compatibility mode is UNAVAILABLE error: There is an issue with the BASH_ENV default variable value being overwrited in the base image. This pull request attempts to fix this problem.
Bump onnx from 1.12.0 to 1.13.0: This pull request is looking to update the onnx dependency from 1.12.0 to 1.13.0. The detailed discussion or response from reviewers are yet to be seen.
Link and Reference Fixes: There is an ongoing theme of updating dead, incorrect or non-performing hyperlinks in markdown files. Notably, these changes have been towards readme and documentations files. This indicates that the project maintainers are working on improving their documentation and readability for users.
Update Dependencies: Another theme is working to update the version of key dependencies, as seen in the current open pull request to bump onnx version.
In the closed pull requests, there are many examples of various fixes and updates. This includes documentation improvements like updating deadlinks and small issues. Additionally, there are several instances of updating libraries and dependencies, specifically on aarch64 libraries, batch manager libraries, and attempts towards updating the TensorRT-LLM code itself. A key takeaway from closed requests is that the project seems active with regular updates and attempts at improvement.
There doesn't seem to be any significant anomalies or major uncertainties within the pull requests. Given the context is limited to the pull requests themselves, it does provide a sense of a well-managed and organized project with active contributors and effective issue-handling. It would be useful to keep an eye on how quickly pull requests are reviewed and merged, as a potential sign of project health.
TensorRT-LLM is a toolbox for Large Language Models (LLMs), developed to build TensorRT engines that efficiently perform inference on NVIDIA GPUs. The project is designed with a Python API similar to PyTorch and supports models on a broad range of GPU configurations. It also offers support for various quantization modes, INT4 or INT8 weights, and an implementation of the SmoothQuant technique.
Key points of attention:
9.1.0.4
in 'Release notes' and 9.1
in badges at project start.