Docling, developed by DS4SD, is a Python-based tool for converting various document formats into HTML, Markdown, and JSON. It excels in PDF parsing and integrates with AI frameworks like LlamaIndex and LangChain. The project is popular with over 15,000 GitHub stars and is actively maintained. It is currently focused on expanding functionality and improving OCR capabilities.
NavigableString
.Recent activities show a strong focus on enhancing document parsing capabilities and addressing user-reported issues promptly.
torchvision
on Python 3.13 (#596) could hinder users upgrading to newer Python versions.Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 44 | 24 | 61 | 0 | 1 |
30 Days | 136 | 81 | 265 | 2 | 1 |
90 Days | 245 | 146 | 614 | 36 | 1 |
All Time | 260 | 157 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
**** | 1 | 0/0/0 | 1 | 103 | 131446 | |
Cesar Berrospi Ramis (ceberam) | 2 | 2/0/0 | 6 | 35 | 54515 | |
Christoph Auer | 3 | 8/7/2 | 40 | 76 | 13736 | |
Michele Dolfi | 2 | 6/7/0 | 7 | 7 | 1799 | |
Panos Vagenas | 2 | 3/5/0 | 6 | 30 | 1388 | |
Nikos Livathinos | 3 | 4/4/0 | 12 | 38 | 1040 | |
Peter W. J. Staar | 1 | 1/1/0 | 1 | 7 | 639 | |
github-actions[bot] | 2 | 0/0/0 | 9 | 2 | 128 | |
Maxim Lysak | 1 | 3/3/0 | 3 | 1 | 70 | |
Abhishek Kumar | 2 | 1/1/1 | 2 | 3 | 62 | |
Gaspard Petit | 1 | 1/1/1 | 1 | 1 | 16 | |
Aini | 2 | 2/1/1 | 2 | 2 | 4 | |
guglie | 1 | 0/1/0 | 1 | 1 | 3 | |
Sander Maijers | 1 | 1/1/0 | 1 | 1 | 1 | |
Ben Rood (bash99) | 0 | 1/0/1 | 0 | 0 | 0 | |
Simonas Jakubonis (simjak) | 0 | 1/0/1 | 0 | 0 | 0 | |
Lucas Morin (lucas-morin) | 0 | 2/0/1 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
Risk | Level (1-5) | Rationale |
---|---|---|
Delivery | 4 | The project faces significant delivery risks due to a backlog of unresolved issues and delays in pull request reviews. The net increase in unresolved issues over the past 90 days indicates potential resource constraints or prioritization challenges. Additionally, several pull requests remain open for extended periods, suggesting bottlenecks in the review process. The lack of thorough documentation and testing updates further exacerbates these risks, as seen in PRs like #240 and #259, which are incomplete and lack necessary tests. |
Velocity | 4 | The project's velocity is at risk due to the slow rate of issue resolution compared to the rate of new issues being opened. The backlog of unresolved issues is growing, indicating that the team's capacity may be insufficient to keep up with demand. Additionally, draft pull requests remaining open for long periods suggest delays in development progress. The concentration of changes among a few developers also poses a risk if these key contributors become unavailable. |
Dependency | 3 | The project relies on a wide range of external libraries, as indicated by the poetry.lock file. While this provides flexibility, it also increases the risk of compatibility issues or failures if these dependencies are not regularly updated. Specific dependency issues, such as those with 'torchvision' on Python 3.13 (#596), highlight potential risks that could impact delivery timelines. |
Team | 3 | The team faces risks related to coordination and communication, as evidenced by unmet review requirements for several pull requests (#606, #557, #530). This suggests possible bottlenecks in the review process and potential challenges in team collaboration. The reliance on key individuals for significant code contributions also poses a risk if these contributors become unavailable. |
Code Quality | 4 | The code quality is at risk due to the high volume of changes being made by a few developers without sufficient peer review or documentation updates. Pull requests like #240 and #259 lack thorough documentation and tests, which could lead to maintainability issues and technical debt accumulation. The presence of parsing errors and unexpected behavior in critical functionalities further underscores these risks. |
Technical Debt | 4 | The project is accumulating technical debt due to incomplete documentation, insufficient testing, and unresolved checklist items across multiple pull requests. The backlog of unresolved issues also contributes to this risk by indicating potential prioritization challenges or resource constraints that prevent timely resolution. |
Test Coverage | 4 | Test coverage is insufficient to catch bugs and regressions effectively. Many pull requests lack comprehensive tests, such as PR#259 and PR#495, which could lead to undetected issues in production environments. The absence of detailed documentation further complicates efforts to ensure robust test coverage. |
Error Handling | 4 | Error handling is inadequate across several areas of the project. Issues such as parsing errors (#351, #435) and dependency resolution problems (#596) highlight gaps in error handling mechanisms. The lack of comprehensive error handling in examples like 'rag_haystack.ipynb' further underscores this risk. |
Recent GitHub issue activity for the Docling project has been robust, with a significant number of issues being opened and closed in the past few days. The issues range from bug reports and feature requests to questions about usage and enhancements. Notably, there have been several issues related to PDF parsing, OCR functionality, and integration with other tools like LlamaIndex and LangChain.
Several issues highlight anomalies or complications, such as:
PermissionError
when trying to convert documents from a directory, indicating potential file access or permission issues.torchvision
on Python 3.13, suggesting compatibility challenges with newer Python versions.Common themes include:
#607: PermissionError when converting documents from a directory.
#602: Enhancement request for EasyOCR to use the recog_network
parameter.
#596: Dependency resolution issue with torchvision
on Python 3.13.
#607: PermissionError when converting documents from a directory.
#602: Enhancement request for EasyOCR to use the recog_network
parameter.
#596: Dependency resolution issue with torchvision
on Python 3.13.
The recent activity indicates active engagement from both users and maintainers in addressing issues and enhancing the tool's functionality. The focus on OCR improvements and integration capabilities suggests ongoing efforts to expand Docling's utility in diverse document processing scenarios.
#618: test: generate file from CLI in a temporary directory
#606: feat: create a backend to parse USPTO patents into DoclingDocument
InputFormat
instances with the same MIME type.#557: feat: Create a backend to transform PubMed XML files to DoclingDocument
#530: feat: Updated Layout processing with forms and key-value areas
#495: fix: Skip NavigableString in HTML parsing
NavigableString
elements during HTML parsing to avoid errors.#474: feat: Add PPTX notes slides
#451: docs: add Weaviate RAG recipe notebook
#259 & #240 (Drafts): feat & dev/update html parser with h1
#616 (Closed without merge): feat: New layout processing with nested forms and key-value areas
#615 & #613 (Merged): docs & feat related to Haystack RAG example and EasyOCR parameter addition
#608 (Merged): docs fix for accelerator example path
Other notable closed PRs include enhancements in AI runtime configuration (#514), handling of unsupported formats (#429), and various bug fixes (#496, #558).
The DS4SD/docling project is actively maintained with several ongoing developments aimed at enhancing functionality and fixing bugs. The project appears to be focusing on expanding its parsing capabilities (e.g., USPTO patents, PubMed XML) while also improving existing features like layout processing and OCR support. However, some process improvements could be made in terms of managing review requirements and addressing long-standing draft PRs.
docs/examples/rag_haystack.ipynb
docling-haystack
, haystack-ai
, and sentence-transformers
. These are installed using %pip install
which is suitable for Jupyter environments.EXPORT_TYPE
allows for flexibility in execution.docling/datamodel/pipeline_options.py
AcceleratorDevice
and TableFormerMode
, enhancing readability and maintainability.docling/models/easyocr_model.py
EasyOcrModel
class inherits from BaseOcrModel
, following good object-oriented design principles.mkdocs.yml
CHANGELOG.md
poetry.lock
Overall, the source files demonstrate good coding practices with clear organization, effective use of modern Python features, and attention to detail in configuration management. Improvements could be made in error handling across some files to enhance robustness.
Panos Vagenas (vagenas)
Aini (itsainii)
recog_network
.easyocr_model.py
and pipeline_options.py
.Nikos Livathinos (nikos-livathinos)
Christoph Auer (cau-git)
Abhishek Kumar (ab-shrek)
Michele Dolfi (dolfim-ibm)
Cesar Berrospi Ramis (ceberam)
Active Development: The team is actively developing new features, fixing bugs, and improving existing functionalities. This includes significant contributions to both the core functionality and documentation.
Collaboration: There is evidence of collaboration among team members, especially in the development of new features like GPU accelerator support, where multiple contributors are involved.
Focus on Performance: Recent commits indicate a focus on enhancing performance through GPU support and optimizing document parsing processes.
Documentation Updates: Continuous updates to documentation suggest an emphasis on maintaining clarity and usability for end-users.
Testing and Validation: Regular updates to test cases and ground-truth data reflect a commitment to ensuring code reliability and correctness.
Feature Expansion: The addition of new features like the USPTO backend parser indicates ongoing efforts to expand the capabilities of the software to handle more document types and use cases.
Overall, the development team is actively engaged in enhancing the Docling project through feature development, performance improvements, and robust testing practices.