MinerU, an open-source tool for converting PDFs and other document formats into machine-readable data, has experienced active development over the past month, with significant attention on enhancing Optical Character Recognition (OCR) capabilities and Docker support.
Recent issues and pull requests (PRs) indicate a strong focus on improving OCR accuracy and table extraction features. Users have reported challenges with table outputs and performance issues during GPU acceleration, suggesting a need for further optimization. Notable issues include #475 related to table recognition bugs and #474 concerning errors with OCR acceleration.
docker-compose.yaml
in PR #467 simplifies deployment for users unfamiliar with Python environments.The MinerU project is actively evolving, driven by a committed development team focused on refining its core functionalities and expanding its accessibility through enhanced deployment options like Docker.
Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 31 | 18 | 73 | 5 | 1 |
30 Days | 236 | 148 | 830 | 11 | 1 |
90 Days | 294 | 184 | 1074 | 21 | 1 |
All Time | 295 | 184 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
icecraft | 2 | 4/4/0 | 4 | 45 | 11714 | |
xuchao | 1 | 0/0/0 | 18 | 11 | 3516 | |
Xiaomeng Zhao | 2 | 10/10/0 | 96 | 34 | 1849 | |
sfk | 1 | 2/1/1 | 22 | 9 | 873 | |
Ikko Eltociear Ashimine | 1 | 1/1/0 | 1 | 3 | 310 | |
liukaiwen | 1 | 0/0/0 | 11 | 14 | 277 | |
Kaiwen Liu | 1 | 6/6/0 | 3 | 15 | 266 | |
yzz | 1 | 1/1/0 | 1 | 1 | 136 | |
Aoyang Fang (Lincyaw) | 1 | 0/1/0 | 1 | 1 | 45 | |
github-actions[bot] | 1 | 0/0/0 | 5 | 1 | 44 | |
quyuan | 1 | 0/0/0 | 1 | 2 | 33 | |
Conghui He | 1 | 0/0/0 | 1 | 1 | 7 | |
drunkpig | 2 | 2/2/0 | 3 | 2 | 7 | |
yyy | 1 | 1/1/0 | 1 | 1 | 6 | |
徐超 | 1 | 0/0/0 | 2 | 4 | 6 | |
Richard Li | 1 | 2/2/0 | 2 | 2 | 6 | |
ZuanZuan | 1 | 1/1/0 | 1 | 1 | 2 | |
Matthijs Zondervan (Matthijz98) | 0 | 1/0/0 | 0 | 0 | 0 | |
邱玉梓 (viket-vista) | 0 | 1/0/0 | 0 | 0 | 0 | |
Arpit Pathak (Thepathakarpit) | 0 | 2/0/2 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
The GitHub repository for the MinerU project has seen a significant amount of recent activity, with 111 open issues currently logged. Notably, several issues are related to bugs and enhancements, indicating ongoing development and user engagement. A recurring theme among the issues is the challenge of accurately extracting tables and text from PDFs, particularly with OCR functionality.
Several issues have been reported regarding the extraction of tables, with users frequently noting that tables are being output as images rather than structured data. Additionally, there are multiple reports of performance issues, including out-of-memory errors and slow processing times when using GPU acceleration. This suggests that while the tool is powerful, it may require further optimization for handling large documents or complex layouts.
Here are some of the most recently created and updated issues:
Issue #476: Zh: 建立一个微信交流群,可以方便大家交流 En: A Discord community group
Issue #475: 使用表格识别后出现bug
magic_pdf.json
.Issue #474: 开启OCR加速后报错
Issue #473: 希望能解除对python=3.10版本的限制,或者支持更高的版本。
Issue #470: 给的例子,注释文件自动忽略,一般情况下,还是希望把注释也识别吧,只是需要有相应的符号区分开来。
Issue #469: 希望能够提供独立部署,通过API的方式进行调用
Issue #468: [Question] Is this solution best for creating knowledge base for AI/LLM memory in your opinion?
Issue #466: 离线环境下部署,报错:Failed to resolve 'paddleocr.bj.bcebos.com'
Issue #465: 碰到横版的pdf,解析效果不好,图片和文字的排版错乱了
Issue #464: 本地部署完成后,运行命令,出现:非法指令 的提示
This analysis highlights both the strengths and challenges faced by the MinerU project as it continues to evolve in response to user needs and technological advancements.
The analysis of the pull requests (PRs) for the MinerU project reveals a mix of ongoing development efforts, feature enhancements, and bug fixes. There are currently two open PRs, while numerous others have been closed, indicating active engagement from contributors.
PR #467: Add Docker compose file and add docker to the readme.md
Created 2 days ago, this PR aims to facilitate Docker usage by adding a docker-compose.yaml
file and updating the README with usage instructions. It addresses user needs for easier deployment alternatives compared to Anaconda.
PR #314: 添加direct_ml方式
Created 19 days ago, this PR introduces a new device option for direct machine learning (direct_ml) in the PDF extraction process. It was edited recently but remains open.
PR #471: build(docker): update docker build step
Closed 1 day ago, this PR updated the Docker build process to switch from CPU to CUDA by default and improved model file downloading logic during builds.
PR #463: feat: add tablemaster_paddle
Merged 3 days ago, this PR added a new table recognition model to enhance the tool's capabilities in processing tables within documents.
PR #458: fix(ocr_mkcontent): improve language detection and content formatting
Merged 3 days ago, this PR optimized language detection logic for better formatting across multiple languages, particularly Asian languages.
PR #447: fix(pdf-extract): adjust box threshold for OCR detection to fix issue about OCR mode lost some line
Merged 3 days ago, it adjusted detection box thresholds to improve OCR accuracy.
PR #418: Update MinerU_CLA.md
Closed without merging, this PR aimed to update the Contributor License Agreement document but lacked sufficient detail in its motivation.
PR #417: Added batch script to automate setup in windows.
Closed without merging, this PR proposed a Windows batch script for easier setup but did not meet contribution guidelines.
PR #410: Update README_zh-CN.md (#404)
Merged 10 days ago, this PR corrected a FAQ URL in the Chinese README file.
PR #400: Update bug_report.yml
Merged 8 days ago, this PR updated the bug report template to improve issue tracking.
PR #396: fix(para_split_v2): index out of range issue of span_text first char
Merged 3 days ago, it fixed an index out-of-range error in text processing logic.
The recent trend shows a strong focus on enhancing Docker support and improving OCR capabilities. The contributions also reflect an emphasis on internationalization with updates to documentation in multiple languages.
The pull requests for the MinerU project illustrate a vibrant community actively engaged in improving the software's functionality and usability. The recent activity indicates a shift towards making the tool more accessible through Docker support, which is crucial for users who prefer containerized applications over traditional installation methods. The addition of docker-compose.yaml
in PR #467 is particularly significant as it simplifies deployment processes for users unfamiliar with Docker configurations.
Moreover, there is a clear focus on enhancing Optical Character Recognition (OCR) capabilities. Several recent PRs (like #463 and #458) specifically target improvements in language detection and table recognition. This aligns with MinerU's goal of providing high-quality data extraction from complex documents, especially scientific literature where tables and multi-language content are prevalent.
Another notable aspect is the community's responsiveness to issues raised by users. For instance, adjustments made in PRs like #447 demonstrate an ongoing commitment to refining OCR accuracy based on user feedback. However, there are also instances of closed PRs that did not meet contribution standards or lacked clarity in their proposals (e.g., PRs #418 and #417). This highlights an area where clearer guidelines could enhance contributor engagement and reduce friction during the submission process.
The presence of multiple contributors signing off on CLA agreements indicates a healthy collaborative environment. However, there are still challenges regarding contributor onboarding and ensuring that all contributors understand the project's contribution guidelines fully.
In summary, while MinerU is progressing well with active contributions focused on enhancing usability and functionality—particularly through Docker integration and improved OCR capabilities—there remain opportunities for refining contribution processes and documentation clarity to foster even greater community involvement.
Xiaomeng Zhao (myhloli)
requirements-docker.txt
and download_models.py
.Kaiwen Liu (papayalove)
Aoyang Fang (Lincyaw)
Focusshang (sfk)
xuchao
icecraft
liukaiwen
zuanzuanshao
nutshellfool
conghui
dt-yy
eltociear
徐超
quyuan
yzztin
The development team is actively engaged in improving MinerU through collaborative efforts focused on both functionality and user documentation. The high volume of commits from key contributors indicates a robust development cycle aimed at refining existing features while also expanding capabilities within the tool.