‹ Reports
The Dispatch

OSS Report: LLaVA-VL/LLaVA-NeXT


LLaVA-NeXT Faces Critical Issues with Model Reproducibility and File Size Mismatches

LLaVA-NeXT, an open-source project aimed at advancing multimodal AI models for image, video, and text understanding, is currently grappling with significant technical challenges that threaten its reliability and usability.

Recent Activity

The repository has been active with 123 open issues, highlighting critical concerns such as reproducibility failures (#178) and size mismatch errors in model files (#177). These issues suggest inconsistencies in model documentation or configuration that could undermine user trust. Additionally, the community's interest in forming real-time support groups (#179) indicates a growing user base seeking structured guidance.

Development Team and Recent Activity

  1. Li Bo (Luodian)

  2. ChunyuanLI

    • Minor updates to README.md for clarity.
  3. Kaichen Zhang (kcz358)

    • Fixed tutorial errors; collaborated on video processing logic.
  4. Yuanhan Zhang (ZhangYuanhan-AI)

    • Enhanced video processing logic in documentation.
  5. Renrui Zhang (ZrrSkywalker)

    • Updated README.md.
  6. Aryeh Hillman (abhillman)

    • Removed redundant dependencies.

Of Note

  1. Reproducibility Concerns: High-priority issue #178 highlights discrepancies in benchmark results, raising questions about the project's reliability.
  2. Model File Errors: Frequent reports of size mismatches (#177) suggest potential documentation or configuration issues.
  3. Community Engagement: The suggestion to create a WeChat group (#179) reflects a need for better real-time support.
  4. Documentation Focus: Recent commits emphasize improving README files, indicating a priority on user guidance.
  5. Collaborative Efforts: Strong collaboration between team members on tutorial content and video processing enhancements suggests a cohesive development approach.

Quantified Reports

Quantify Issues



Recent GitHub Issues Activity

Timespan Opened Closed Comments Labeled Milestones
7 Days 21 7 13 21 1
30 Days 63 14 99 62 1
90 Days 130 31 268 128 1
All Time 161 38 - - -

Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.

Quantify commits



Quantified Commit Activity Over 30 Days

Developer Avatar Branches PRs Commits Files Changes
Li Bo 1 2/2/0 32 84 7655
Yuanhan Zhang 2 0/0/0 9 11 460
Kaichen Zhang - NTU 1 1/1/0 7 6 124
ChunyuanLI 2 3/2/0 3 1 29
Aryeh Hillman (abhillman) 1 1/1/0 1 1 3
Renrui Zhang 1 0/0/0 1 1 2
Xiaodong Wang (Wang-Xiaodong1899) 0 1/0/0 0 0 0

PRs: created by that dev and opened/merged/closed-unmerged during the period

Detailed Reports

Report On: Fetch issues



Recent Activity Analysis

The LLaVA-NeXT GitHub repository has seen significant activity recently, with 123 open issues and a notable influx of new discussions surrounding model performance and usage. Several issues highlight critical problems, such as failures to reproduce results from the original paper and size mismatch errors in model files. Themes of community engagement are evident, particularly in requests for clarification on model configurations and training data.

A few issues stand out due to their implications for the project's reliability and usability. For instance, the frequent reports of size mismatches and installation errors suggest potential inconsistencies in the model files or documentation. Additionally, the emergence of community-driven discussions about creating chat groups for real-time support indicates a growing user base that may require more structured guidance.

Issue Details

Recently Created Issues

  1. #179: Community Chatting Group

    • Priority: Low
    • Status: Open
    • Created: 0 days ago
    • Comments: Suggests creating a WeChat group for real-time discussions among users.
  2. #178: Failure to reproduce the paper results

    • Priority: High
    • Status: Open
    • Created: 1 day ago
    • Comments: Reports discrepancies in benchmark results after cloning datasets from Huggingface.
  3. #177: Size Mismatch Issue in ‘mm_projector.bin’ for ‘llava-onevision-qwen2-0.5b-ov’ Model

    • Priority: High
    • Status: Open
    • Created: 2 days ago
    • Comments: User encounters runtime errors related to size mismatches when running fine-tuning scripts.
  4. #175: Resources required for training with 72b language models

    • Priority: Medium
    • Status: Open
    • Created: 2 days ago
    • Comments: Inquires about GPU requirements for training with large language models.
  5. #174: Cauldron dataset

    • Priority: Low
    • Status: Open
    • Created: 3 days ago
    • Comments: Seeks clarification on datasets labeled with "cauldron" in the released dataset.

Recently Updated Issues

  1. #172: Llava-Next-OV FPS

    • Priority: Medium
    • Status: Open (Edited)
    • Updated: 2 days ago
    • Comments: User finds FPS information in the paper and seeks confirmation on inference settings.
  2. #171: The effectiveness of Stage-1.5

    • Priority: Medium
    • Status: Open (Edited)
    • Updated: 2 days ago
    • Comments: Questions the impact of Stage-1.5 training on final metrics.
  3. #170: Update of the cli.py script

    • Priority: Low
    • Status: Open (Edited)
    • Updated: 2 days ago
    • Comments: Requests updates to the CLI script to reflect new model configurations.
  4. #169: Batch inference for LLaVA One Vision

    • Priority: Medium
    • Status: Open (Edited)
    • Updated: 2 days ago
    • Comments: Inquires about feasible methods for conducting batch inference.
  5. #168: UReader data mismatch with images

    • Priority: High
    • Status: Open (Edited)
    • Updated: 2 days ago
    • Comments: Reports issues with mismatched data between UReader and corresponding images.

Summary

The recent activity on LLaVA-NeXT's GitHub repository indicates a vibrant community grappling with various technical challenges related to model performance and configuration. The issues raised reflect both critical bugs that could hinder usability and broader inquiries into the project's architecture and data handling practices. Addressing these concerns promptly will be essential for maintaining user trust and project momentum.

Report On: Fetch pull requests



Overview

The LLaVA-NeXT project currently has 9 open pull requests and 10 closed pull requests, reflecting ongoing development and community engagement in enhancing its multimodal capabilities. The pull requests cover a range of updates, bug fixes, and documentation improvements.

Summary of Pull Requests

Open Pull Requests

  • PR #160: Update README.md
    Created 6 days ago by Chunyuan LI. This PR adds 12 lines to the README file, improving documentation clarity.

  • PR #136: fix prepare_inputs_labels_for_multimodal in llava_arch
    Created 11 days ago by Xiaodong Wang. This PR addresses a bug that caused repeated additions of image_feature when image_idx was not in video_idx_in_batch.

  • PR #84: Fix prepare inputs labels for multimodal
    Created 56 days ago by Khai Mai. This PR adds assertions to ensure the number of images matches the number of image tokens and fixes a bug related to handling zero images in input.

  • PR #75: Update LLaVA-NeXT.md - Typo (missing letter)
    Created 58 days ago by Tayyib Chohan. A minor typo fix in documentation.

  • PR #73: Make some ad-hoc changes to use the interleave model
    Created 58 days ago by Haruki Sakajo. This PR includes changes to utilize the interleave model, with minor code adjustments.

  • PR #65: Features update
    Created 64 days ago by Socialstranger. This PR introduces several new features, including new HTML files.

  • PR #40: Samples
    Created 89 days ago by Alastair D'Silva. This PR adds sample scripts and a Gradio UI, along with several bug fixes.

  • PR #34: fix: Conversation.copy()
    Created 93 days ago by Yuki Imajuku. This PR improves the copy() function but is noted as potentially redundant.

  • PR #23: Fixed Prompt formatting in conversation.py
    Created 98 days ago by KT313. This PR corrects prompt formatting issues that led to duplicate tokens.

Closed Pull Requests

  • PR #180: Update LLaVA OneVision model to lmms-lab/llava-onevision-qwen2-7b-ov
    Merged recently, this PR updates the OneVision model documentation.

  • PR #163: Update README.md
    Merged recently, this PR includes minor updates to the README file.

  • PR #161: Update README.md
    Merged recently, this PR also contains updates to the README file with more substantial changes than PR #163.

  • PR #152: Provide the correct video processing logic with decord
    Merged recently, this PR enhances video processing logic in tutorials.

  • PR #134: fix imports of missing and deprecated qwen-moe
    Merged recently, this PR resolves import issues related to deprecated models.

  • PR #112: Remove Redundant sentencepiece Dep
    Merged recently, this PR removes unnecessary dependencies from the project configuration.

  • PR #67: add multi-image inference
    Merged recently, this significant update introduces multi-image inference capabilities.

  • PR #29 & PR #27: fix class not found error
    Both were closed without merging due to being deemed unnecessary after discussion among contributors.

  • PR #1: Update README.md
    Closed due to being trivial and flagged as a contributor seeking attention through minor edits across various repositories.

Analysis of Pull Requests

The current state of pull requests for the LLaVA-NeXT project indicates an active development environment with a strong focus on both functionality and documentation. The open pull requests largely revolve around fixing bugs related to multimodal input handling (e.g., PRs #136 and #84), which suggests an ongoing effort to refine the model's capability to process diverse data types effectively. The emphasis on ensuring that inputs are correctly matched with their corresponding features highlights a commitment to robust functionality within the multimodal framework.

Documentation improvements are also prevalent, as seen in multiple recent pull requests (#160, #163, and others) aimed at enhancing clarity and usability for end-users. Such efforts are crucial for community engagement, especially given that LLaVA-NeXT is an open-source project that relies on contributions from users who may not be deeply familiar with its architecture or intended use cases.

Notably, there is a mix of minor corrections (like typos) alongside substantial feature additions (e.g., multi-image inference in PR #67). This blend indicates a healthy balance between maintaining existing code quality and pushing forward with innovative features that could enhance user experience and application versatility.

However, there are some anomalies worth noting. For instance, several pull requests (#29 and #27) were closed without merging due to discussions indicating they were unnecessary or redundant. This suggests potential communication gaps or misunderstandings within the contributor community regarding what changes are essential versus those that may be considered trivial or already addressed elsewhere. Additionally, the presence of a contributor who has been flagged for making trivial edits across multiple repositories raises concerns about genuine contributions versus opportunistic behavior aimed at gaining visibility within popular projects.

The lack of recent merge activity in certain areas may also indicate bottlenecks or resource constraints within the development team. Given that many open pull requests have been created within a relatively short timeframe (e.g., several within the last two months), it may be beneficial for maintainers to prioritize reviews and merges more systematically to keep momentum going and encourage further contributions from the community.

In conclusion, while LLaVA-NeXT demonstrates robust activity in terms of feature development and community engagement through pull requests, attention should be paid to ensuring effective communication among contributors and maintaining a clear focus on meaningful enhancements rather than trivial edits. Addressing these aspects will be vital for sustaining growth and innovation within this promising multimodal AI project.

Report On: Fetch commits



Repo Commits Analysis

Development Team and Recent Activity

Team Members and Recent Contributions

  1. Li Bo (Luodian)

    • Recent Activity:
    • Merged pull requests related to updates on the LLaVA OneVision model and README documentation.
    • Made significant contributions to the LLaVA_OneVision_Tutorials.ipynb file, including fixing tutorial errors and updating video processing logic.
    • Engaged in refactoring efforts to improve image size handling and conversation logic across various scripts.
    • Active in updating README files across multiple directories, enhancing documentation clarity.
    • Collaboration: Frequently collaborated with Kaichen Zhang on tutorial updates and Yuanhan Zhang on video code improvements.
  2. ChunyuanLI

    • Recent Activity:
    • Contributed minor updates to the README.md, focusing on documentation clarity.
    • Collaboration: Worked alongside Li Bo for merging changes related to README updates.
  3. Kaichen Zhang (kcz358)

    • Recent Activity:
    • Contributed to fixing errors in the tutorials and updating the LLaVA_OneVision.md documentation.
    • Collaborated with Li Bo on providing correct video processing logic in tutorials.
    • Collaboration: Actively worked with Li Bo on tutorial-related changes.
  4. Yuanhan Zhang (ZhangYuanhan-AI)

    • Recent Activity:
    • Made multiple updates to video-related documentation and code, including enhancements in video processing logic.
    • Engaged in significant updates to the README.md files across various branches.
    • Collaboration: Collaborated with Li Bo on video code enhancements and documentation.
  5. Renrui Zhang (ZrrSkywalker)

    • Recent Activity:
    • Contributed a single commit updating the README.md file.
    • Collaboration: Limited collaboration noted; primarily focused on documentation.
  6. Aryeh Hillman (abhillman)

    • Recent Activity:
    • Removed redundant dependencies from the project configuration files.
    • Collaboration: Collaborated with Li Bo on dependency management.

Patterns and Themes

  • Documentation Focus: A significant amount of recent activity revolves around enhancing documentation, particularly the README files, indicating a priority for user guidance and clarity.
  • Collaborative Efforts: There is a strong collaborative environment, especially between Li Bo, Kaichen Zhang, and Yuanhan Zhang, focusing on improving tutorial content and video processing capabilities.
  • Refactoring Initiatives: The team is actively engaged in refactoring efforts aimed at improving code maintainability and performance, particularly concerning image processing and multimodal functionalities.
  • Feature Enhancements: Recent commits reflect ongoing enhancements to multimodal capabilities, particularly in video understanding, aligning with the project's goals of advancing state-of-the-art performance.

Conclusions

The development team is actively engaged in both feature development and maintenance tasks, with a clear emphasis on improving documentation and collaborative efforts. The focus on multimodal capabilities suggests a commitment to advancing the project's objectives within the AI research community.