OSS Report: opendatalab/MinerU

Aug. 23, 2024, 12:30 a.m. UTC This report was generated by Dispatch AI

MinerU Project Sees Active Development with Focus on OCR and Docker Support

MinerU, an open-source tool for converting PDFs and other document formats into machine-readable data, has experienced active development over the past month, with significant attention on enhancing Optical Character Recognition (OCR) capabilities and Docker support.

Recent Activity

Recent issues and pull requests (PRs) indicate a strong focus on improving OCR accuracy and table extraction features. Users have reported challenges with table outputs and performance issues during GPU acceleration, suggesting a need for further optimization. Notable issues include #475 related to table recognition bugs and #474 concerning errors with OCR acceleration.

Development Team and Recent Activity

Xiaomeng Zhao (myhloli)
- Authored 96 commits, focusing on OCR processing bugs, Docker deployment, and documentation updates.
Kaiwen Liu (papayalove)
- Contributed 3 commits addressing table recognition bugs and new functionalities.
Aoyang Fang (Lincyaw)
- Added a Dockerfile for deployment; 1 commit.
Focusshang (sfk)
- Updated documentation; 22 commits.
xuchao
- Improved documentation; 18 commits.
icecraft
- Major contributions with 4 commits, focusing on PDF extraction enhancements.
liukaiwen
- Worked on table recognition and OCR improvements; 11 commits.
zuanzuanshao
- Minor contributions; 1 commit.
nutshellfool
- Updated model download instructions; 2 commits.
conghui
- Documentation updates; 1 commit.
dt-yy
- Documentation updates; 1 commit.
eltociear
- Added Japanese README; 1 commit.
徐超
- Logo updates and documentation improvements; 2 commits.
quyuan
- Documentation updates; 1 commit.
yzztin
- Added new files for usage; 1 commit.

Of Note

The introduction of Docker support via docker-compose.yaml in PR #467 simplifies deployment for users unfamiliar with Python environments.
Enhancements in OCR capabilities, as seen in PRs #463 and #458, align with MinerU's goal of high-quality data extraction from scientific documents.
Performance issues during GPU acceleration highlight the need for optimization when handling large or complex documents.
The community's active engagement in discussions about potential enhancements suggests rapid improvements in future releases.
Localization efforts are evident, with multiple language-specific documentation updates reflecting a focus on internationalization.

The MinerU project is actively evolving, driven by a committed development team focused on refining its core functionalities and expanding its accessibility through enhanced deployment options like Docker.

Quantified Reports

Quantify Issues

Recent GitHub Issues Activity

Timespan	Opened	Closed	Comments	Labeled	Milestones
7 Days	31	18	73	5	1
30 Days	236	148	830	11	1
90 Days	294	184	1074	21	1
All Time	295	184	-	-	-

_{Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.}

Quantify commits

Quantified Commit Activity Over 30 Days

Developer	Branches	PRs	Commits	Files	Changes
icecraft	2	4/4/0	4	45	11714
xuchao	1	0/0/0	18	11	3516
Xiaomeng Zhao	2	10/10/0	96	34	1849
sfk	1	2/1/1	22	9	873
Ikko Eltociear Ashimine	1	1/1/0	1	3	310
liukaiwen	1	0/0/0	11	14	277
Kaiwen Liu	1	6/6/0	3	15	266
yzz	1	1/1/0	1	1	136
Aoyang Fang (Lincyaw)	1	0/1/0	1	1	45
github-actions[bot]	1	0/0/0	5	1	44
quyuan	1	0/0/0	1	2	33
Conghui He	1	0/0/0	1	1	7
drunkpig	2	2/2/0	3	2	7
yyy	1	1/1/0	1	1	6
徐超	1	0/0/0	2	4	6
Richard Li	1	2/2/0	2	2	6
ZuanZuan	1	1/1/0	1	1	2
Matthijs Zondervan (Matthijz98)	0	1/0/0	0	0	0
邱玉梓 (viket-vista)	0	1/0/0	0	0	0
Arpit Pathak (Thepathakarpit)	0	2/0/2	0	0	0

_{PRs: created by that dev and opened/merged/closed-unmerged during the period}

Detailed Reports

Report On: Fetch issues

Recent Activity Analysis

The GitHub repository for the MinerU project has seen a significant amount of recent activity, with 111 open issues currently logged. Notably, several issues are related to bugs and enhancements, indicating ongoing development and user engagement. A recurring theme among the issues is the challenge of accurately extracting tables and text from PDFs, particularly with OCR functionality.

Several issues have been reported regarding the extraction of tables, with users frequently noting that tables are being output as images rather than structured data. Additionally, there are multiple reports of performance issues, including out-of-memory errors and slow processing times when using GPU acceleration. This suggests that while the tool is powerful, it may require further optimization for handling large documents or complex layouts.

Issue Details

Here are some of the most recently created and updated issues:

Issue #476: Zh: 建立一个微信交流群，可以方便大家交流 En: A Discord community group
- Priority: Low
- Status: Open
- Created: 0 days ago
- Update: N/A
Issue #475: 使用表格识别后出现bug
- Priority: High
- Status: Open
- Created: 1 day ago
- Update: N/A
- Description: Bug related to table recognition settings in magic_pdf.json.
Issue #474: 开启OCR加速后报错
- Priority: High
- Status: Open
- Created: 1 day ago
- Update: N/A
- Description: Error encountered when enabling OCR acceleration.
Issue #473: 希望能解除对python=3.10版本的限制，或者支持更高的版本。
- Priority: Medium
- Status: Open
- Created: 1 day ago
- Update: N/A
Issue #470: 给的例子，注释文件自动忽略，一般情况下，还是希望把注释也识别吧，只是需要有相应的符号区分开来。
- Priority: Medium
- Status: Open
- Created: 2 days ago
- Update: N/A
Issue #469: 希望能够提供独立部署，通过API的方式进行调用
- Priority: Medium
- Status: Open
- Created: 2 days ago
- Update: Edited 1 day ago
Issue #468: [Question] Is this solution best for creating knowledge base for AI/LLM memory in your opinion?
- Priority: Low
- Status: Open
- Created: 2 days ago
- Update: N/A
Issue #466: 离线环境下部署，报错：Failed to resolve 'paddleocr.bj.bcebos.com'
- Priority: High
- Status: Open
- Created: 3 days ago
Issue #465: 碰到横版的pdf，解析效果不好，图片和文字的排版错乱了
- Priority: High
- Status: Open
- Created: 3 days ago
Issue #464: 本地部署完成后，运行命令，出现：非法指令的提示
- Priority: High
- Status: Open
- Created: 3 days ago

Important Observations

There is a clear focus on improving the OCR capabilities and table extraction features within the project, as evidenced by numerous bug reports and enhancement requests.
Users are experiencing significant performance issues when processing larger documents or utilizing GPU acceleration, indicating a need for optimization.
The community is actively engaged in discussions about potential enhancements and bug fixes, which could lead to rapid improvements in future releases.
The presence of multiple language-specific issues suggests that localization and internationalization may be areas for further development.

This analysis highlights both the strengths and challenges faced by the MinerU project as it continues to evolve in response to user needs and technological advancements.

Report On: Fetch pull requests

Overview

The analysis of the pull requests (PRs) for the MinerU project reveals a mix of ongoing development efforts, feature enhancements, and bug fixes. There are currently two open PRs, while numerous others have been closed, indicating active engagement from contributors.

Summary of Pull Requests

Open Pull Requests

PR #467: Add Docker compose file and add docker to the readme.md
Created 2 days ago, this PR aims to facilitate Docker usage by adding a docker-compose.yaml file and updating the README with usage instructions. It addresses user needs for easier deployment alternatives compared to Anaconda.
PR #314: 添加direct_ml方式
Created 19 days ago, this PR introduces a new device option for direct machine learning (direct_ml) in the PDF extraction process. It was edited recently but remains open.

Closed Pull Requests

PR #471: build(docker): update docker build step
Closed 1 day ago, this PR updated the Docker build process to switch from CPU to CUDA by default and improved model file downloading logic during builds.
PR #463: feat: add tablemaster_paddle
Merged 3 days ago, this PR added a new table recognition model to enhance the tool's capabilities in processing tables within documents.
PR #458: fix(ocr_mkcontent): improve language detection and content formatting
Merged 3 days ago, this PR optimized language detection logic for better formatting across multiple languages, particularly Asian languages.
PR #447: fix(pdf-extract): adjust box threshold for OCR detection to fix issue about OCR mode lost some line
Merged 3 days ago, it adjusted detection box thresholds to improve OCR accuracy.
PR #418: Update MinerU_CLA.md
Closed without merging, this PR aimed to update the Contributor License Agreement document but lacked sufficient detail in its motivation.
PR #417: Added batch script to automate setup in windows.
Closed without merging, this PR proposed a Windows batch script for easier setup but did not meet contribution guidelines.
PR #410: Update README_zh-CN.md (#404)
Merged 10 days ago, this PR corrected a FAQ URL in the Chinese README file.
PR #400: Update bug_report.yml
Merged 8 days ago, this PR updated the bug report template to improve issue tracking.
PR #396: fix(para_split_v2): index out of range issue of span_text first char
Merged 3 days ago, it fixed an index out-of-range error in text processing logic.

Notable Trends

The recent trend shows a strong focus on enhancing Docker support and improving OCR capabilities. The contributions also reflect an emphasis on internationalization with updates to documentation in multiple languages.

Analysis of Pull Requests

The pull requests for the MinerU project illustrate a vibrant community actively engaged in improving the software's functionality and usability. The recent activity indicates a shift towards making the tool more accessible through Docker support, which is crucial for users who prefer containerized applications over traditional installation methods. The addition of docker-compose.yaml in PR #467 is particularly significant as it simplifies deployment processes for users unfamiliar with Docker configurations.

Moreover, there is a clear focus on enhancing Optical Character Recognition (OCR) capabilities. Several recent PRs (like #463 and #458) specifically target improvements in language detection and table recognition. This aligns with MinerU's goal of providing high-quality data extraction from complex documents, especially scientific literature where tables and multi-language content are prevalent.

Another notable aspect is the community's responsiveness to issues raised by users. For instance, adjustments made in PRs like #447 demonstrate an ongoing commitment to refining OCR accuracy based on user feedback. However, there are also instances of closed PRs that did not meet contribution standards or lacked clarity in their proposals (e.g., PRs #418 and #417). This highlights an area where clearer guidelines could enhance contributor engagement and reduce friction during the submission process.

The presence of multiple contributors signing off on CLA agreements indicates a healthy collaborative environment. However, there are still challenges regarding contributor onboarding and ensuring that all contributors understand the project's contribution guidelines fully.

In summary, while MinerU is progressing well with active contributions focused on enhancing usability and functionality—particularly through Docker integration and improved OCR capabilities—there remain opportunities for refining contribution processes and documentation clarity to foster even greater community involvement.

Report On: Fetch commits

Repo Commits Analysis

Development Team and Recent Activity

Team Members and Activities

Xiaomeng Zhao (myhloli)
- Recent Activity:
- Created requirements-docker.txt and download_models.py.
- Fixed bugs related to OCR processing and table recognition.
- Collaborated with multiple team members on various features including Docker deployment, bug fixes, and documentation updates.
- Active in updating FAQs and README files for clarity and accuracy.
- Total of 96 commits in the last 30 days, indicating high engagement.
Kaiwen Liu (papayalove)
- Recent Activity:
- Co-authored several commits addressing bugs in the table recognition feature.
- Involved in adding new functionalities like bounding box drawing for models.
- Contributed to fixing issues related to OCR content formatting.
- Total of 3 commits in the last 30 days.
Aoyang Fang (Lincyaw)
- Recent Activity:
- Contributed by adding a Dockerfile for improved deployment.
- Total of 1 commit in the last 30 days.
Focusshang (sfk)
- Recent Activity:
- Worked on updating documentation, particularly README files for clarity.
- Involved in fixing issues related to FAQs.
- Total of 22 commits in the last 30 days.
xuchao
- Recent Activity:
- Focused on documentation updates and improvements.
- Total of 18 commits in the last 30 days.
icecraft
- Recent Activity:
- Contributed significantly with a total of 4 commits, focusing on fixing bugs and enhancing features related to PDF extraction.
- Total of 11714 changes across various files, indicating major contributions.
liukaiwen
- Recent Activity:
- Involved in fixing table recognition bugs and improving OCR functionality.
- Total of 11 commits in the last 30 days.
zuanzuanshao
- Recent Activity:
- Minor contributions with a focus on signing CLA.
- Total of 1 commit in the last 30 days.
nutshellfool
- Recent Activity:
- Contributed by updating model download instructions.
- Total of 2 commits in the last 30 days.
conghui
- Recent Activity:
- Minor contributions with a focus on documentation.
- Total of 1 commit in the last 30 days.
dt-yy
- Recent Activity:
- Minor contributions with a focus on documentation updates.
- Total of 1 commit in the last 30 days.
eltociear
- Recent Activity:
- Contributed by adding a Japanese README file.
- Total of 1 commit in the last 30 days.
徐超
- Recent Activity:
- Minor contributions focused on logo updates and documentation improvements.
- Total of 2 commits in the last 30 days.
quyuan
- Recent Activity:
- Minor contributions with a focus on documentation updates.
- Total of 1 commit in the last 30 days.
yzztin
- Recent Activity:
- Contributed by adding new files for MinerU usage.
- Total of 1 commit in the last 30 days.

Patterns and Themes

The majority of recent activity is centered around bug fixes, feature enhancements (especially related to OCR and table recognition), and extensive documentation updates to improve user experience and clarity.
Xiaomeng Zhao is notably the most active contributor, indicating strong leadership or ownership over project development tasks.
Collaboration is evident among team members, particularly when addressing complex features or bugs, as seen through co-authorships on several commits.
Documentation has been a significant focus area, reflecting an understanding of its importance for user adoption and support, especially given the project's complexity and target audience (researchers).
The introduction of Docker support suggests an effort to streamline deployment processes for users, enhancing accessibility for non-technical users or those unfamiliar with Python environments.

Conclusions

The development team is actively engaged in improving MinerU through collaborative efforts focused on both functionality and user documentation. The high volume of commits from key contributors indicates a robust development cycle aimed at refining existing features while also expanding capabilities within the tool.