OSS Report: Lightning-AI/litgpt

Sept. 20, 2024, 1:30 a.m. UTC This report was generated by Dispatch AI

LitGPT Development Focuses on Compatibility and Performance Enhancements

LitGPT, a framework for large language models, has seen active development with a focus on improving compatibility across platforms and optimizing performance. The project supports pretraining, finetuning, and deployment of LLMs, emphasizing ease of use and scalability.

Recent Activity

Recent issues and pull requests indicate ongoing efforts to address compatibility challenges, particularly with CUDA errors and multi-GPU setups. The team is also working on documentation improvements for custom datasets and advanced features like LoRA.

Development Team Activity

Sebastian Raschka (rasbt)
- Added Chainlit Studio (3 days ago).
- Simplified MPS support (6 days ago).
- Enabled MPS support for LitGPT (7 days ago).
Motsepe-Jr (challenger)
- Fixed device error in Decode Stream (10 days ago).
Jirka Borovec (Borda)
- Minor README update/typos (11 days ago).
apaz-cli (apaz)
- Added batched_generate_fn() (14 days ago).
Thomas Viehmann (t-vi)
- Improved testing for macOS compatibility.
Sander Land (sanderland)
- Updated check_nvlink_connectivity (30 days ago).
Andrei-Aksionov
- Worked on disabling attention masks.

The team is actively collaborating, with Sebastian Raschka leading major contributions and others focusing on specific areas like testing and documentation.

Of Note

MPS Support: Recent enhancements to MPS support reflect a focus on expanding hardware compatibility.
Chainlit Studio Integration: New tutorial additions show efforts to improve user interface connections.
Memory Optimization: PRs addressing memory usage in model training highlight performance improvements.
Fine-Tuning Techniques: Active development in fine-tuning methods like LongLora indicates a push towards advanced model training capabilities.
Community Engagement: Ongoing discussions in PRs and issues demonstrate strong community involvement in the project's evolution.

Quantified Reports

Quantify Issues

Recent GitHub Issues Activity

Timespan	Opened	Closed	Comments	Labeled	Milestones
7 Days	6	2	14	0	1
30 Days	27	13	72	0	1
90 Days	78	46	178	2	1
1 Year	388	199	1119	147	2
All Time	746	538	-	-	-

_{Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.}

Quantify commits

Quantified Commit Activity Over 30 Days

Developer	Branches	PRs	Commits	Files	Changes
Sebastian Raschka	3	14/14/0	22	29	2100
apaz	2	3/3/1	7	9	942
Thomas Viehmann	2	3/2/0	14	7	196
Sander Land	2	2/2/0	3	4	114
None (Andrei-Aksionov)	1	1/0/1	2	4	69
Jirka Borovec	1	1/1/0	1	1	10
challenger	1	1/1/0	1	2	6

_{PRs: created by that dev and opened/merged/closed-unmerged during the period}

Detailed Reports

Report On: Fetch issues

Recent Activity Analysis

The Lightning-AI/litgpt repository currently has 208 open issues, with recent activity indicating a mix of bug reports, feature requests, and user inquiries. Notably, issues related to CUDA errors and model loading problems have been prevalent, suggesting potential challenges with GPU compatibility and model configuration.

Several issues highlight common themes such as difficulties in multi-GPU training, the need for better documentation on custom datasets, and inconsistencies in model performance across different setups. The presence of multiple unresolved queries about quantization and LoRA further indicates that users are actively seeking clarity on advanced features.

Issue Details

Most Recently Created Issues

Issue #1733: cuda error when serve with workers_per_device > 1
- Priority: Enhancement
- Status: Open
- Created: 1 day ago
- Comments: Discussion about CUDA assertion failures when using multiple workers.
Issue #1729: use initial_checkpoint_dir for continue-pretraining but can't load model correctly
- Priority: Documentation/Enhancement/Question
- Status: Open
- Created: 2 days ago
- Comments: User reports issues loading models after changing checkpoint directories.
Issue #1727: Question about tie_embeddings
- Priority: Question
- Status: Open
- Created: 6 days ago
- Comments: User queries about the implementation of tied embeddings in the model.
Issue #1723: Is Support for the DeepSeek v2.5 model on the roadmap?
- Priority: Checkpoints
- Status: Open
- Created: 8 days ago
- Comments: Inquiry about future support for a specific model.
Issue #1717: Cannot attend to 9904, block size is only 4096
- Priority: Question
- Status: Open
- Created: 9 days ago
- Comments: User faces an issue with sequence length exceeding model limits.

Most Recently Updated Issues

Issue #1715: llm.generate issue on CPU machines
- Priority: Bug
- Status: Open (recently edited)
- Last Updated: 8 days ago
Issue #1714: llm.generate function does not work on Mac (MPS) devices anymore
- Priority: Bug
- Status: Open (recently edited)
- Last Updated: 8 days ago
Issue #1711: Manual convert_to_litgpt for Phi-3.5-mini-instruct downloaded weights from HF
- Priority: Bug
- Status: Open (recently edited)
- Last Updated: 13 days ago

Analysis of Notable Issues

A recurring theme in recent issues is related to CUDA errors and compatibility, particularly when using multiple GPUs or specific configurations like LoRA and quantization.
The lack of clear documentation regarding custom datasets has led to confusion among users trying to implement their own data for pretraining or finetuning.
Users are actively seeking support for new models and features, indicating a vibrant community interest in expanding the capabilities of LitGPT.

The complexity of integrating various features like quantization and LoRA into existing workflows appears to be a significant pain point for users, highlighting the need for improved guidance and examples in the documentation.

Report On: Fetch pull requests

Overview

The analysis of the pull requests (PRs) for the Lightning-AI/litgpt project reveals a vibrant and active development environment. The project has seen significant contributions in terms of new features, bug fixes, and enhancements aimed at improving performance, usability, and compatibility with various hardware setups. The PRs indicate a strong focus on expanding model support, optimizing training and inference processes, and enhancing the overall user experience.

Summary of Pull Requests

Open Pull Requests

PR #1725: bump macos to m1
- Status: Open
- Significance: Addresses compatibility issues with macOS M1 chips, attempting to resolve segfaults and failing tests.
- Notable: Involves discussions about test failures on different Mac models and potential memory issues.
PR #1538: Do not wrap LoRA layers with FSDP
- Status: Open
- Significance: Aims to optimize memory usage by modifying how FSDP wraps layers in the Transformer block.
- Notable: Discussion on memory usage comparisons before and after the change.
PR #1421: WIP: TensorParallel with new strategy
- Status: Open
- Significance: Demonstrates the application of a new ModelParallelStrategy in generating/tp.py.
- Notable: Highlights potential for applying more parallelism in model training.
PR #1354: Add resume for adapter_v2, enable continued finetuning for adapter
- Status: Open
- Significance: Introduces functionality to resume finetuning for adapters, addressing issues faced by users with time-limited GPU access.
- Notable: Discussion on updating step_count and iteration count during resuming.
PR #1350: Add LongLora for both full and lora fine-tuning
- Status: Open
- Significance: Implements LongLora functionality for both LoRa and full fine-tuning processes.
- Notable: Discussion on context length handling and supported options.
PR #1331: example for full finetuning with python code done!
- Status: Open
- Significance: Provides a non-CLI Python code example for full finetuning, aimed at helping users get started with LitGPT.
- Notable: Directly addresses user needs for clearer starting points in experimentation.
PR #1232: Correct an apparent logger output directory bug
- Status: Open
- Significance: Fixes a hardcoded logger output directory issue that could confuse users.
- Notable: Discussion on logger name consistency across different logging methods.
PR #1179: Improved Lora finetuning script
- Status: Open
- Significance: Enhances the Lora finetuning script by checking validation data length against training data length.
- Notable: Discussion on maintaining consistency in training prompts.
PR #1057: [WIP] Simplified preparation of pretraining datasets
- Status: Open
- Significance: Aims to simplify dataset preparation for pretraining tasks, particularly useful for large-scale data.
- Notable: Blocked by issues with running multiple optimize calls together.
PR #1013: Drop interleave placement in QKV matrix
- Status: Open
- Significance: Changes the placement of weights in the QKV matrix to align with newer models' practices.
- Notable: Discussion on potential breaking changes and performance differences due to this modification.

Closed Pull Requests

PR #1728: Add Chainlit Studio
- Merged 3 days ago.
- Adds a tutorial showing how to connect LitGPT to a UI using Chainlit.
PR #1726: Simplify MPS support
- Merged 6 days ago.
- Simplifies MPS support based on community suggestions, addressing numerical differences between MPS and CPU operations.
PR #1724: Enable MPS support for LitGPT
- Merged 7 days ago.
- Re-enables MPS support with an alternative implementation of index_copy_.
Additional PRs focused on version bumps, bug fixes, and minor enhancements that reflect ongoing maintenance and improvement efforts within the project.

Analysis of Pull Requests

The pull requests indicate a strong focus on enhancing compatibility across different hardware platforms (e.g., macOS M1 support), optimizing memory usage (e.g., not wrapping LoRA layers with FSDP), and expanding functionality (e.g., adding LongLora support). The discussions within these PRs highlight collaborative efforts among contributors to address complex challenges such as memory management during model training and inference, ensuring consistent behavior across different environments, and providing clearer guidance to users through improved examples and documentation.

The presence of multiple open PRs related to fine-tuning techniques (e.g., LoRA, LongLora) suggests an active interest in advancing model training methodologies within the community. Additionally, the quick turnaround time for merging PRs that fix bugs or improve usability reflects a commitment to maintaining high-quality standards in the project's development lifecycle.

Overall, the analysis of these pull requests showcases a dynamic development environment where contributors are actively working towards enhancing the capabilities, performance, and user experience of LitGPT while also addressing technical challenges associated with large language models.

Report On: Fetch commits

Repo Commits Analysis

Development Team and Recent Activity

Team Members:

Sebastian Raschka (rasbt)
- Recent Activity:
- Added Chainlit Studio (3 days ago).
- Simplified MPS support (6 days ago).
- Enabled MPS support for LitGPT (7 days ago).
- Multiple version bumps and minor fixes related to dependencies and precision settings (8 days ago).
- Active in merging branches and collaborating with other members.
Motsepe-Jr (challenger)
- Recent Activity:
- Fixed device error in Decode Stream (10 days ago).
Jirka Borovec (Borda)
- Recent Activity:
- Minor README update/typos (11 days ago).
apaz-cli (apaz)
- Recent Activity:
- Added batched_generate_fn() (14 days ago).
- Contributed to various features including batched next token and sampling, with multiple merges and updates in the past month.
Thomas Viehmann (t-vi)
- Recent Activity:
- Focused on testing improvements, particularly for macOS compatibility, with several commits in the last week.
Sander Land (sanderland)
- Recent Activity:
- Updated check_nvlink_connectivity and contributed to utility functions (30 days ago).
Andrei-Aksionov
- Recent Activity:
- Worked on disabling attention masks and merging branches related to this feature (25-30 days ago).

Patterns and Themes:

Dominance of Sebastian Raschka: He is the most active contributor, focusing on major features, bug fixes, and version management.
Collaboration: Frequent co-authorship among team members indicates a collaborative environment, especially between Raschka and apaz.
Testing Focus: Recent activity shows an emphasis on improving tests, particularly for macOS compatibility, indicating a push towards stability.
Feature Development: The team is actively developing new features like MPS support and batched generation functions, which are crucial for performance enhancements.
Version Management: Regular version bumps suggest ongoing maintenance and readiness for releases.

Conclusions:

The development team is actively engaged in enhancing the LitGPT framework through collaborative efforts, focusing on both feature development and stability improvements. The leadership of Sebastian Raschka is evident in the volume of contributions, while other team members support specific areas such as testing and documentation. Overall, the team's recent activities reflect a strong commitment to advancing the project effectively.