The "Docling" project, under the DS4SD organization, is a Python-based tool for converting various document formats into Markdown and JSON. It features advanced PDF understanding, OCR support, and integration with LlamaIndex and LangChain. The project is actively developed, with a substantial community following.
Panos Vagenas (vagenas)
Michele Dolfi (dolfim-ibm)
Christoph Auer (cau-git)
Peter W. J. Staar (PeterStaar-IBM)
Maxim Lysak (maxmnemonic)
Bill Murdock (jwm4)
Mohamed Ali (moli-debugger)
Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 16 | 2 | 10 | 14 | 1 |
30 Days | 31 | 21 | 42 | 27 | 1 |
90 Days | 59 | 35 | 103 | 43 | 1 |
All Time | 60 | 35 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
**** | 1 | 0/0/0 | 1 | 72 | 37873 | |
Peter W. J. Staar | 3 | 3/3/1 | 11 | 63 | 16483 | |
Christoph Auer | 2 | 2/2/0 | 3 | 69 | 9110 | |
Bill Murdock (jwm4) | 1 | 1/0/0 | 2 | 1 | 1182 | |
Panos Vagenas | 4 | 6/5/0 | 7 | 13 | 811 | |
Maxim Lysak | 1 | 3/3/1 | 3 | 7 | 626 | |
Maksym Lysak | 1 | 0/0/0 | 5 | 3 | 441 | |
Michele Dolfi | 2 | 7/6/0 | 8 | 12 | 412 | |
Mohamed Ali | 1 | 0/1/0 | 1 | 1 | 58 | |
github-actions[bot] | 1 | 0/0/0 | 4 | 2 | 51 | |
Johnny Salazar (cepera-ang) | 0 | 1/0/0 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
Risk | Level (1-5) | Rationale |
---|---|---|
Delivery | 4 | The project faces significant delivery risks due to a backlog of unresolved issues and incomplete pull requests. The recent GitHub issues activity shows a high number of open issues compared to closed ones, indicating potential delays in achieving project goals. Notably, critical issues like #213 (import error) and #210 (incorrect table recognition) remain unresolved, which could severely impact delivery timelines. Additionally, several pull requests, such as PR #95 and PR #132, have been in draft status for extended periods without significant updates, suggesting bottlenecks in the review or development process. |
Velocity | 4 | The project's velocity is at risk due to the growing backlog of issues and the slow progress on key pull requests. While there is active development with contributions from multiple developers, the disparity in commit volumes and the presence of long-standing draft pull requests indicate potential coordination challenges. The lack of progress on drafts like PR #95 and PR #132 suggests that prioritization or completion of feature development is lagging, which could slow down overall project momentum. |
Dependency | 3 | The project exhibits moderate dependency risks due to its reliance on various external libraries and systems. The update to torch dependencies (#190) highlights potential compatibility issues if not managed carefully. Additionally, feature requests such as support for arxiv HTML papers (#209) introduce further dependency risks as they may rely on external systems that require careful integration. Ensuring that all dependencies are up-to-date and compatible is crucial to mitigate these risks. |
Team | 3 | Team-related risks are moderate, primarily due to potential communication challenges and uneven contribution levels among team members. The low number of comments on issues suggests limited discussion or collaboration on resolving them. Additionally, the disparity in commit volumes among developers indicates possible coordination challenges that could affect team dynamics and efficiency. |
Code Quality | 4 | The risk to code quality is significant due to the high volume of changes across multiple files without thorough testing or documentation. Pull requests often lack necessary examples and tests, which poses a risk of introducing bugs or incomplete features into the codebase. The substantial changes made by individual contributors, such as Peter W. J. Staar's 11 commits impacting 63 files, highlight the need for rigorous review processes to maintain high code quality. |
Technical Debt | 4 | Technical debt is accumulating due to unresolved issues related to core functionalities and incomplete pull requests. The backlog of unresolved issues, such as poor mathematical expression extraction (#212), indicates underlying flaws that need addressing to prevent future maintenance challenges. Moreover, the absence of documentation and tests in several pull requests suggests an accumulation of technical debt if these gaps are not addressed promptly. |
Test Coverage | 5 | Test coverage is a critical risk area as many pull requests lack necessary tests and documentation. This gap poses significant risks to delivery and code quality since untested features may introduce unforeseen issues into the codebase. The absence of comprehensive testing across multiple PRs highlights a systemic issue that needs urgent attention to ensure reliable software performance. |
Error Handling | 4 | Error handling presents a notable risk due to insufficient testing and documentation in new features. While some efforts have been made to address specific error handling issues, such as PR #214's UTF-8 encoding fix, the overall lack of comprehensive error handling mechanisms across the project could lead to undetected errors and reliability concerns. |
Recent GitHub issue activity for the DS4SD/docling project shows a flurry of new issues created within the last few days, indicating active development and user engagement. The issues range from minor documentation errors to significant functionality requests and bug reports. Notably, there are several issues related to document conversion accuracy, particularly with mathematical expressions and table recognition, suggesting ongoing challenges in these areas. Additionally, there are requests for new features such as support for arxiv HTML parsing and exporting markdown with image references.
#215: Typo in documentation (docs/usage.md
).
#213: Import error with 'PipelineOptions'.
#212: Poor extraction of mathematical expressions.
#211: Request to export markdown with image references.
#210: Incorrect table recognition results.
#181: AttributeError in HierarchalChunker with LlamaIndex integration.
#176: Error in export_to_document_tokens
function call.
#174: Markdown export issue with underscores needing escape.
#166: Unable to render as doc tags post-update.
#163: Input format discovery improvement suggestion.
This analysis highlights the project's ongoing efforts to refine its document conversion capabilities while also addressing user-reported bugs and feature requests to enhance overall functionality and user experience.
PR #214:
PR #203:
PR #194:
PR #193:
PR #132:
PR #95:
PR #196:
PR #183:
Closed without Merge (e.g., PR #159, #157):
docling/document_converter.py
pydantic
), and project-specific modules.FormatOption
class. This is a good use of inheritance to manage different document types.pydantic
validators ensures that the data is validated before processing, which is a robust design choice.docling/backend/html_backend.py
HTMLDocumentBackend
class handles HTML document conversion. It initializes with an input document and manages parsing through BeautifulSoup.analyse_element
are quite lengthy and handle many cases, which can be simplified.docling/backend/md_backend.py
marko
library for Markdown parsing, which is suitable for this backend's needs.MarkdownDocumentBackend
class processes Markdown documents. It includes initialization, validation, unloading, and conversion methods.pyproject.toml
poetry.lock
poetry update
to ensure dependencies are up-to-date with security patches.CHANGELOG.md
ci.yml
, cd.yml
, cd-docs.yml
, ci-docs.yml
)ci.yml
) and deployment (cd.yml
, cd-docs.yml
) processes.ci-docs.yml
).Panos Vagenas (vagenas)
Michele Dolfi (dolfim-ibm)
Christoph Auer (cau-git)
Peter W. J. Staar (PeterStaar-IBM)
Maxim Lysak (maxmnemonic)
Bill Murdock (jwm4)
Mohamed Ali (moli-debugger)
Collaborative Efforts: There is a strong collaborative effort among team members, with multiple co-authored commits and shared responsibilities across features and fixes.
Focus on Documentation and Usability: Several updates were made to documentation files, indicating an emphasis on improving user guidance and project usability.
Continuous Improvement: The team is actively involved in refining existing features, such as improving the CLI, enhancing document parsing capabilities, and fixing bugs related to document formatting.
Feature Expansion: New features like advanced chunking, profiling options, and support for additional document formats (AsciiDoc, Markdown) are being actively developed.
Active Branch Management: Multiple branches are being used for feature development, bug fixes, and documentation updates, showing organized development practices.
Overall, the development team is actively engaged in enhancing the functionality of the Docling project while ensuring comprehensive documentation and robust testing frameworks are maintained.