The olmOCR
project, developed by the Allen Institute for Artificial Intelligence (AI2), is a Python toolkit designed to streamline the training of language models for processing PDF documents. It focuses on linearizing PDFs for large language model datasets and training. The project is actively maintained and exhibits strong community interest, as evidenced by its significant number of stars and forks. However, it faces challenges with dependency management and hardware compatibility issues.
Jake Poznanski (jakep-allenai)
Aman Rangapur (aman-17)
modeling_molmo.py
.Kyle Lo (kyleclo)
kylel/elo
branch.Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 30 | 5 | 43 | 19 | 1 |
14 Days | 30 | 5 | 43 | 19 | 1 |
All Time | 30 | 5 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
Jake Poznanski | ![]() |
1 | 0/0/0 | 54 | 55 | 7283 |
“aman-17” | ![]() |
1 | 0/0/0 | 7 | 13 | 1645 |
Aman Rangapur | ![]() |
1 | 0/0/0 | 2 | 23 | 300 |
Kyle Lo | ![]() |
1 | 0/0/0 | 1 | 4 | 215 |
Ikko Eltociear Ashimine (eltociear) | 0 | 1/0/0 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
Risk | Level (1-5) | Rationale |
---|---|---|
Delivery | 4 | The project faces significant delivery risks due to a backlog of unresolved issues and pull requests. The imbalance between issues opened and closed, with 30 issues opened and only 5 closed recently, indicates a growing backlog (#45689). Additionally, critical bugs like those affecting RTX A6000 GPUs (#55) and Jupyter on VSCode (#54) suggest severe technical challenges that could impede delivery. The prolonged open status of dependency update PRs (#26, #25, #22) further complicates timely delivery, as unresolved dependencies could lead to compatibility issues. |
Velocity | 4 | Velocity is at risk due to uneven workload distribution among developers and a backlog of unresolved PRs. Jake Poznanski's significant contributions (54 commits) contrast sharply with minimal activity from others, indicating potential bottlenecks if key individuals become unavailable (#45690). The lack of PR activity suggests integration challenges that could slow progress. Additionally, the accumulation of unresolved issues and feature requests (#45693) indicates potential scope creep, further affecting velocity. |
Dependency | 3 | While the project proactively manages dependencies using tools like Dependabot, the backlog of unresolved dependency update PRs (#26, #25, #22) poses risks. These prolonged open statuses suggest potential compatibility or testing issues that could impact stability. The reliance on specific hardware (e.g., RTX 4090) and software configurations also introduces dependency risks if these components become outdated or unsupported (#45688). |
Team | 3 | The team faces risks related to workload distribution and potential burnout. The heavy reliance on key individuals like Jake Poznanski for substantial contributions indicates possible bottlenecks (#45690). While active discussion on issues suggests good communication, the imbalance in contributions could lead to team dynamics issues or burnout if not addressed. |
Code Quality | 3 | Code quality is at moderate risk due to the lack of significant code changes or new features in recent PRs, which are mostly routine dependency updates (#45692). The absence of thorough review processes for these changes raises concerns about code quality assurance. Additionally, complex scripts like 'olmocr/bench/convert.py' and 'olmocr/bench/miners/automine.py' introduce risks if not adequately documented or tested (#45695). |
Technical Debt | 4 | Technical debt is a significant concern due to unresolved bugs related to CUDA memory management and PDF processing (#50, #49). The backlog of open PRs and issues suggests accumulating debt that could hinder future development. The reliance on key individuals for merging changes also poses risks if these tasks are delayed or overlooked. |
Test Coverage | 3 | Test coverage is at moderate risk due to the lack of explicit test details in complex scripts like 'olmocr/bench/convert.py' and 'olmocr/bench/miners/automine.py' (#45695). The absence of PR activity further suggests potential gaps in testing processes before integration. This could lead to undetected bugs affecting code quality and functionality. |
Error Handling | 3 | Error handling is moderately risky as current scripts show limited error management capabilities. For example, 'olmocr/bench/miners/automine.py' has limited error handling in API interactions, which could lead to unhandled exceptions (#45695). Enhancing error handling mechanisms across the codebase is necessary to improve reliability. |
The olmOCR
project has experienced a surge of activity with numerous issues created in the past few days, indicating active engagement from the community. The issues range from feature requests and bug reports to inquiries about deployment and usage.
Several issues exhibit notable anomalies or complications. For instance, #55 reports a persistent execution issue with RTX A6000 GPUs, with repeated connection failures and warnings about processor speed settings. This suggests a critical bug affecting users with high-end hardware. Issue #54 describes an import error in Jupyter on VSCode, highlighting potential environment-specific challenges. Additionally, #50 discusses CUDA out-of-memory errors when processing large PDFs, pointing to scalability limitations in the current pipeline.
Common themes among the issues include deployment challenges (e.g., Docker support in #59 and #46), feature requests for additional functionalities like text detection box coordinates (#43) and NER capabilities (#37), and compatibility concerns (e.g., macOS support in #33). There are also multiple reports of bugs related to PDF processing (#49, #47, #45).
#59: Docker support
#58: May you share evaluation results?
#57: HTTP demo suggest
#56: Describing diagrams and technical manuals
#55: Issue with RTX A6000 execution
#54: Cannot import olmocr modules
#53: Support of formattings (strikethroughs, etc.)
#51: sglang or vllm api interface
#50: CUDA-ooM with large PDFs
#49: SGlang does not meet expectations.
allenai/olmocr
Repository#44: chore: update preprocessing_molmo.py
preprocessing_molmo.py
file, changing "seperator" to "separator".#26: Bump datasets from 3.0.0 to 3.2.0
datasets
dependency to version 3.2.0.#25: Update aiohttp requirement from <3.11,>=3.10 to >=3.10,<3.12
aiohttp
dependency to allow newer versions.#22: Update mypy requirement from <1.5,>=1.0 to >=1.0,<1.14
mypy
dependency to permit newer versions.#20: Update sphinx requirement from <7.1.0,>=4.3.0 to >=4.3.0,<8.2.0
sphinx
dependency.#19 through #5 (Various Dependency Updates)
sphinx-autodoc-typehints
, sphinx-autobuild
, and GitHub Actions workflows.#28: Resolved Git checks and updated readme
#27: Molmo
Overall, while there is active development and maintenance within the repository, addressing the backlog of open PRs could enhance project stability and security moving forward.
olmocr/bench/convert.py
importlib
for dynamic imports is appropriate given the need to load methods dynamically.parse_method_arg
: Well-documented function that parses method configuration strings. It handles various data types and raises errors for incorrect formats, which is good practice.process_pdfs
: Asynchronous function that processes PDFs using specified methods. It handles both synchronous and asynchronous methods, which adds flexibility. The use of tqdm
for progress indication is a nice touch for user feedback.argparse
to handle command-line arguments effectively. It dynamically builds a configuration dictionary for the specified methods, ensuring only available methods are used.olmocr/bench/miners/automine.py
syntok
and google.genai
suggests reliance on third-party services.clean_base_sentence
: Interacts with an external API to clean sentences. Error handling could be improved around the API call.parse_sentences
: Uses syntok
to split text into sentences, preserving original formatting, which is crucial for OCR tasks.compare_votes_for_file
: Compares sentences from different sources, using a voting mechanism based on similarity scores. This function is well-structured but could benefit from more detailed comments explaining the logic.argparse
to manage input paths and execute the comparison process. It reads files from directories and processes them efficiently.olmocr/eval/dolma_refine/aligners.py
HirschbergAligner
and NeedlemanWunschAligner
: Both classes extend a base aligner class and implement specific alignment algorithms. They are well-defined with customizable parameters.olmocr/eval/dolma_refine/metrics.py
DocumentEditSimilarity
and ParagraphEditSimilarity
: These classes implement specific text metrics using alignment strategies. They are well-documented with clear method definitions.find_align_gaps
, which are critical for metric calculations but could benefit from additional comments explaining their purpose.olmocr/eval/dolma_refine/registry.py
olmocr/eval/dolma_refine/segmenters.py
pyproject.toml
README.md
.github/workflows/main.yml
scripts/infinigram_count.py
get_random_line_from_s3
: Implements reservoir sampling efficiently but lacks error handling around S3 interactions.query_infinigram
: Sends requests to an external API; should improve exception handling around network calls.process_document
: Tokenizes documents using transformers; this function is complex but well-commented.Overall, the source files demonstrate a high level of organization, adherence to best practices in Python programming, and thoughtful design patterns such as registries for extensibility. Some areas could benefit from improved error handling or additional comments for clarity.
modeling_molmo.py
).amanr/bench
).kylel/elo
branch.High Activity Level: The repository is under active development with frequent commits, especially by Jake Poznanski, indicating ongoing enhancements and bug fixes.
Focus Areas:
Independent Contributions with Integration: Team members appear to be working independently on their respective branches or tasks but integrate their work into the main branch frequently.
Documentation Updates: Regular updates to README files indicate an effort to keep documentation current with development progress.
Collaborative Problem Solving: There is evidence of collaborative efforts in resolving issues, particularly between Jake Poznanski and Aman Rangapur.
Overall, the development team is actively enhancing the olmOCR project with a focus on improving existing functionalities, adding new features, and maintaining high code quality through refactoring efforts.