OSS Report: LLaVA-VL/LLaVA-NeXT

Sept. 22, 2024, 1:30 p.m. UTC This report was generated by Dispatch AI

LLaVA-NeXT Development Focuses on Enhancing Video Processing and Documentation

LLaVA-NeXT, a framework for integrating language and vision capabilities, continues to refine its multimodal functionalities with a focus on video processing and user documentation improvements.

Recent Activity

Recent issues and pull requests indicate a strong emphasis on resolving model performance discrepancies (#254) and improving documentation clarity (#256). The development team is actively addressing technical errors related to model loading (#248) and missing dependencies (#249), reflecting ongoing efforts to streamline the user experience.

Development Team and Recent Activity

Li Bo (Luodian)
- Updated README.md (3 days ago).
- Merged PR #205 for video inference logic (5 days ago).
ChunyuanLI
- Updated release dates in README.md (7 days ago).
Tianyi Xiong (tyxiong23)
- Added DPO training scripts to LLaVA_OneVision_Chat.md (7 days ago).
Yuanhan Zhang (ZhangYuanhan-AI)
- Refactored video loading function (3 days ago).
Kaichen Zhang (kcz358)
- Merged PRs related to video processing (19 days ago).
Nguyen-Quang-Trung (ngquangtrung57)
- Contributed safe load tokenizer for llama_3 (23 days ago).
Raushan Turganbay (zucchini-nlp)
- Updated demo files in tutorial notebook (23 days ago).

The team is actively collaborating on documentation and video processing enhancements, indicating a cohesive strategy towards improving multimodal capabilities.

Of Note

Model Performance Discrepancies: Issues like #254 highlight critical areas needing attention to ensure consistent model outputs.
Documentation Clarity: Frequent updates suggest a concerted effort to enhance user guidance, particularly around new features.
Video Processing Enhancements: Significant contributions towards refining video-related functionalities reflect ongoing improvements.
Collaboration Patterns: Team members frequently collaborate on documentation and feature enhancements, showcasing effective teamwork.
Active Development Cycle: The continuous stream of commits and PRs indicates a vibrant development process focused on iterative improvements.

Quantified Reports

Quantify Issues

Recent GitHub Issues Activity

Timespan	Opened	Closed	Comments	Labeled	Milestones
7 Days	17	4	8	17	1
30 Days	67	17	99	66	1
90 Days	168	42	349	165	1
All Time	228	56	-	-	-

_{Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.}

Quantify commits

Quantified Commit Activity Over 30 Days

Developer	Branches	PRs	Commits	Files	Changes
Yuanhan Zhang	2	2/2/0	5	9	746
Li Bo	1	1/1/0	12	10	537
Tianyi Xiong	1	4/3/1	19	8	359
ChunyuanLI	1	0/0/0	6	2	26
Kaichen Zhang - NTU	1	1/1/0	2	2	16
Nguyen-Quang-Trung	1	1/1/0	1	1	10
Raushan Turganbay	1	1/1/0	1	1	3
None (NarekN7)	0	1/0/0	0	0	0
None (litianjian)	0	1/0/0	0	0	0
None (TayyibChohan)	0	0/1/0	0	0	0
Xiaodong Wang (Wang-Xiaodong1899)	0	0/0/1	0	0	0

_{PRs: created by that dev and opened/merged/closed-unmerged during the period}

Detailed Reports

Report On: Fetch issues

Recent Activity Analysis

The LLaVA-NeXT project currently has 172 open issues, with recent activity indicating a steady stream of inquiries and bug reports. Notably, several issues revolve around model performance discrepancies and configuration challenges, reflecting the complexity of integrating multimodal capabilities.

Common themes include confusion regarding model parameters, particularly in relation to different versions (e.g., 0.5B vs. 7B models), and requests for clarification on training data and evaluation metrics. There is also a significant focus on resolving technical errors related to model loading and inference.

Issue Details

Recent Issues

Issue #257: Does LLaVA-NeXT support 336x336 image inputs, like LLaVA-1.5?
- Priority: Low
- Status: Open
- Created: 0 days ago
- Updated: N/A
Issue #256: What is the purpose of the three sh files in script/interleave since we can evaluate using lmms-eval?
- Priority: Medium
- Status: Open
- Created: 1 day ago
- Updated: N/A
Issue #255: Video/Image Processing (padding, channel order)
- Priority: Medium
- Status: Open
- Created: 2 days ago
- Updated: N/A
Issue #254: Model performs well when using flash_attention_2 or SDPA, but outputs "!!!!" when using the original attention.
- Priority: High
- Status: Open
- Created: 2 days ago
- Updated: 1 day ago
Issue #253: How to merge LoRA fine-tuned model with base model?
- Priority: Medium
- Status: Open
- Created: 3 days ago
- Updated: N/A
Issue #249: dpo_ov7b.sh imports data_processing which is missing.
- Priority: High
- Status: Open
- Created: 5 days ago
- Updated: N/A
Issue #248: Running the eval example script for Llava-next-video reports an error.
- Priority: High
- Status: Open
- Created: 5 days ago
- Updated: N/A
Issue #247: 3 PyTorch allocator cache flushes since last step.
- Priority: Low
- Status: Open
- Created: 5 days ago
- Updated: N/A
Issue #245 & #244 & #243 & #242 & #240 & #239 & #238 & #234 & #233 & #232 & #231 & #230 & #229 & #227 & #226 & #224 & #223 & #221 & #220 & #219 & #218 & #217 & #216 & #215 & #214 & #213 & #212 & #211 & #210... (Multiple issues related to community discussions, feature requests, and minor bugs.)

Analysis of Notable Issues

Several issues highlight critical areas of concern:

The discrepancies in model performance between versions (e.g., Issue #254) suggest potential underlying bugs or configuration mismatches that need addressing.
The frequent inquiries about the purpose of specific scripts (e.g., Issue #256) indicate a need for clearer documentation regarding the project's structure and usage.
Issues related to missing dependencies (e.g., Issue #249) are common, pointing to potential gaps in setup instructions or package management.

Conclusion

The ongoing activity within the LLaVA-NeXT repository reflects a vibrant community engaged in troubleshooting and enhancing the multimodal capabilities of the framework. The concentration of issues around model performance and configuration suggests areas for improvement in documentation and user guidance, which could facilitate smoother user experiences moving forward.

Report On: Fetch pull requests

Overview

The LLaVA-NeXT project has a series of active and closed pull requests that reflect ongoing development and maintenance efforts. The open pull requests focus on enhancing functionality, fixing bugs, and improving documentation, while the closed pull requests indicate a history of active contributions and iterative improvements.

Summary of Pull Requests

Open Pull Requests

PR #252: Redesigning prompt
- Focuses on adding inference scripts for LLaVA models.
- Introduces new notebooks for model inference.
PR #250: Fix typos
- A minor fix addressing typographical errors in the codebase.
PR #160: Update README.md
- Updates to the README file, likely for clarity or additional information.
PR #84: Fix prepare inputs labels for multimodal
- Addresses input preparation for multimodal tasks, ensuring correct handling of cases with no images.
PR #73: Make some ad-hoc changes to use the interleave model
- Implements changes to integrate the interleave model into the existing framework.
PR #65: Features update
- Introduces new features or updates existing ones, though details are vague.
PR #40: Samples
- Adds sample scripts and a Gradio UI for better demonstration and usability.
PR #34: fix: Conversation.copy()
- Improves the copy() function in conversation handling.
PR #23: Fixed Prompt formatting in conversation.py
- Fixes prompt formatting issues to avoid duplicate tokens.

Closed Pull Requests

PR #241, #237, #236, #235: Documentation updates
- These PRs focus on updating documentation related to LLaVA-OneVision-Chat, including training scripts and contributor lists.
PR #228: Revert "Fix: videos in LLaVa-OV"
- Reverts a previous change related to video handling in tutorials.
PR #205, #198, #195, #183: Video processing updates
- These PRs involve updates to video processing logic, including inference logic and tokenizer loading safety.
PR #180: Update LLaVA OneVision model to lmms-lab/llava-onevision-qwen2-7b-ov
- Updates the model version used within the project.

Analysis of Pull Requests

The open pull requests indicate a strong focus on enhancing functionality and fixing bugs within the LLaVA-NeXT framework. The presence of PRs like #252 and #84 suggests ongoing efforts to improve model inference capabilities and handle multimodal inputs more effectively. PRs addressing typos (#250) and documentation updates (#160) reflect an emphasis on maintaining code quality and providing clear guidance to users.

Closed pull requests reveal a history of active development with a mix of feature enhancements (#205, #198) and maintenance tasks (#241, #237). The reversion of changes in PR #228 highlights a responsive approach to development where adjustments are made based on feedback or issues encountered post-deployment. The updates related to video processing (#195, #183) suggest an ongoing effort to refine this aspect of the framework, which is crucial given its multimodal capabilities.

Overall, the pull request activity demonstrates a vibrant development process with a focus on continuous improvement, user experience enhancement, and robust community engagement through transparent collaboration.

Report On: Fetch commits

Repo Commits Analysis

Development Team and Recent Activity

Team Members

Li Bo (Luodian)
- Recent Activity:
- Updated README.md (3 days ago).
- Merged PR #205 to update video inference logic (5 days ago).
- Contributed to multiple updates in documentation and training scripts related to LLaVA-OneVision-Chat.
- Collaborated with ChunyuanLI and Tianyi Xiong on documentation updates.
ChunyuanLI
- Recent Activity:
- Updated release dates in README.md and contributed to the documentation for LLaVA-OneVision-Chat (7 days ago).
- Collaborated with Li Bo on various documentation updates.
Tianyi Xiong (tyxiong23)
- Recent Activity:
- Made numerous updates to the LLaVA_OneVision_Chat.md, including adding DPO training scripts (7 days ago).
- Collaborated with Li Bo and ChunyuanLI on documentation improvements.
Yuanhan Zhang (ZhangYuanhan-AI)
- Recent Activity:
- Refactored video loading function and added new training scripts for video processing (3 days ago).
- Worked on updating video inference logic and contributed significantly to video-related files.
Kaichen Zhang (kcz358)
- Recent Activity:
- Merged PRs related to video processing and updated tutorials (19 days ago).
Nguyen-Quang-Trung (ngquangtrung57)
- Recent Activity:
- Contributed a safe load tokenizer for llama_3 (23 days ago).
Raushan Turganbay (zucchini-nlp)
- Recent Activity:
- Updated demo files in the tutorial notebook (23 days ago).

Summary of Activities

The team has been actively updating documentation, particularly for the LLaVA-OneVision-Chat feature, which suggests a focus on improving user experience and clarity.
Significant contributions have been made towards enhancing video processing capabilities, indicating an ongoing effort to refine multimodal functionalities.
Collaboration is evident between team members, especially among Li Bo, ChunyuanLI, and Tianyi Xiong, who frequently work together on documentation and feature enhancements.
The recent activities show a strong emphasis on merging pull requests that improve both functionality and documentation, reflecting a cohesive development strategy.

Patterns and Conclusions

Collaboration: There is a clear pattern of collaboration among team members, particularly in documentation efforts and feature development.
Focus Areas: The recent commits highlight a concentrated effort on improving video processing capabilities and enhancing user interaction through better documentation.
Active Development: The frequency of commits indicates an active development cycle with ongoing improvements being made across various components of the project.
Documentation Emphasis: A notable amount of activity is dedicated to updating documentation, which is crucial for user engagement and understanding of new features.

Overall, the development team is engaged in a productive cycle of enhancing both the functionality and usability of the LLaVA-NeXT project.