GitHub Repo Analysis: Zipstack/unstract

Feb. 15, 2025, 3 p.m. UTC This report was generated by Dispatch AI

Executive Summary

The "Unstract" project by Zipstack is an open-source, no-code platform for intelligent document processing (IDP 2.0), leveraging large language models (LLMs) to create APIs and ETL pipelines. It supports integration with major LLM providers and vector databases, offering tools like Prompt Studio and Workflow Studio for automating document processes. The project is actively developed with a significant community presence, as evidenced by its GitHub activity.

Cross-Platform Challenges: Users face setup issues on Windows and macOS (#1110, #239).
Version Upgrade Complications: Difficulties in upgrading while retaining data (#1093).
Integration Issues: Problems with document indexing and workflow execution (#1043, #595).
Active Development: High commit frequency and numerous branches indicate ongoing enhancements.

Recent Activity

Team Members and Activities

Hari John Kuriakose
- Authored PR #1134 to add issue template config; pending CLA status.
Gayathri
- Authored PR #1133 for SDK version update; lacks detailed explanations.
Chandrasekharan M
- Authored PRs #1132, #1131, #1129 focusing on model updates and execution enhancements.
Praveen Kumar
- Worked on version bumping for SDKs and dependency management.
Tahier Hussain
- Focused on frontend UI/UX improvements.

Recent Issues and PRs

#1110: Windows setup error; high priority.
#1093: Version upgrade complications; medium priority.
#1134: Open PR for issue template config; pending CLA.
#1133: Open PR for SDK version roll; quality checks passed.

Risks

Cross-Platform Compatibility: Persistent setup issues on Windows suggest documentation or support gaps.
Data Migration Complexity: Users struggle with version upgrades, indicating a need for better migration guides.
Integration Stability: Document indexing and workflow execution errors point to potential bugs in integrations with vector databases or LLMs.

Of Note

Pending CLA Status for PRs: Delays in merging due to unsigned Contributor License Agreements could hinder progress.
Dynamic Import Handling in Frontend: The use of dynamic imports suggests extensibility but may introduce performance considerations.
Extensive Branch Management: The high number of branches (97) reflects active development but requires careful management to avoid fragmentation.

Quantified Reports

Quantify issues

Recent GitHub Issues Activity

Timespan	Opened	Closed	Comments	Labeled	Milestones
7 Days	0	0	0	0	0
30 Days	3	1	8	1	1
90 Days	10	8	21	2	1
All Time	34	21	-	-	-

_{Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.}

Rate pull requests

PR#1133 - SDK version roll to 0.57.0rc5open

2_/5

Gayathri (gaya3-zipstack)Created: 2025-02-14

The pull request is primarily a version bump for the SDK and related dependencies, which is a routine maintenance task. It lacks detailed documentation or testing notes, and does not address potential impacts on existing features. While necessary, it is not significant or complex enough to warrant a higher rating.

[+] Read More

PR#1109 - Initial end to end test included in build workflowopen

3_/5

Ritwik G (ritwik-g)Created: 2025-01-31

This pull request introduces an initial end-to-end test using Selenium, which is a positive step towards automated testing and regression detection. However, the PR lacks detailed documentation and context in several sections, such as 'How', 'Can this PR break any existing features', and others, which are left empty. The changes are relatively minor, focusing on setting up a basic login test and integrating it into the build workflow. While it adds value by automating tests, the scope is limited to a single test case with no significant complexity or innovation. Therefore, it is an average contribution that could be improved with more comprehensive testing and documentation.

[+] Read More

PR#1129 - feat: Updated execution time in WF and file execution modelsopen

3_/5

Chandrasekharan M (chandrasekharan-zipstack)Created: 2025-02-11

The pull request introduces a necessary feature by updating execution time calculations and adding data migrations, which are important for logging enhancements. However, it lacks differentiation between queued and actual execution times, as pointed out in the review comments. While the changes are functional and pass quality checks, they don't introduce significant improvements or innovations, making it an average PR with room for refinement.

[+] Read More

PR#1134 - Add issue template configopen

3_/5

Hari John Kuriakose (hari-kuriakose)Created: 2025-02-15

The pull request introduces an issue template configuration and updates the README for improved readability. While these changes are beneficial, they are relatively minor and do not significantly impact the codebase or functionality. The PR does not introduce any new features or critical fixes, nor does it pose any risk to existing features. It is a straightforward update with clear intentions but lacks substantial significance or complexity to warrant a higher rating.

[+] Read More

PR#1089 - FEAT: Unified Notifications Feature Implementationopen

4_/5

Tahier Hussain (tahierhussain)Created: 2025-01-24

The pull request implements a significant feature by introducing a unified notifications system that enhances both backend and frontend functionalities. It addresses key issues with log and notification persistence, improves user experience through UI/UX enhancements, and ensures global availability of the logs component across all pages. The use of Redis for message storage is a robust choice, and the PR includes thoughtful refactoring to make the logs component more flexible and user-friendly. However, there are some concerns about potential breaking changes due to major refactoring, and the testing documentation is lacking. Overall, it is a well-executed and impactful change but not without some areas for improvement.

[+] Read More

PR#1101 - [FEATURE] Remote storage flag removal for remote storageopen

4_/5

Gayathri (gaya3-zipstack)Created: 2025-01-29

The pull request effectively removes feature flag checks for remote storage, streamlining the codebase and preparing it for production. It includes comprehensive changes across multiple files, indicating a significant refactor. The PR also addresses unit tests and resolves merge conflicts, demonstrating thoroughness. However, the lack of detailed documentation or migration notes slightly detracts from its completeness.

[+] Read More

PR#1126 - refactor: Dockerfile optimization to build fasteropen

4_/5

Chandrasekharan M (chandrasekharan-zipstack)Created: 2025-02-10

The pull request offers a significant improvement by optimizing Dockerfile build times, achieving an 11.67% speed increase for the prompt-service. It effectively reorganizes Dockerfile instructions to leverage caching, which is a valuable enhancement for development efficiency. The changes are well-documented, and the potential for breaking existing features is minimal as it only involves reordering instructions. However, the PR lacks comprehensive testing across all services and does not provide measured improvements for all Dockerfiles, which slightly limits its impact. Overall, it's a well-executed refactor with room for additional validation.

[+] Read More

PR#1125 - refactor: FE dockerfile optimized for better cachingopen

4_/5

Chandrasekharan M (chandrasekharan-zipstack)Created: 2025-02-10

This pull request effectively optimizes the FE Dockerfile by utilizing a smaller base image and restructuring the build process to enhance caching, resulting in a notable 22.28% reduction in build time. The changes are well-documented and tested locally, ensuring no existing features are broken. While the improvements are significant and beneficial for build efficiency, the PR is relatively straightforward and lacks broader impact beyond the specific optimization. Thus, it merits a rating of 4 for being quite good but not exemplary.

[+] Read More

PR#1132 - feat: Updated status enum to use TextChoices for 3 modelsopen

4_/5

Chandrasekharan M (chandrasekharan-zipstack)Created: 2025-02-13

The pull request effectively updates the status field for several models to use Django's TextChoices, improving code clarity and consistency. It includes necessary schema migrations and addresses previous misuse of enums, enhancing maintainability. The changes are well-documented, and testing has been conducted on existing APIs. However, the PR is dependent on other PRs (#1129 and #1131), which could complicate the merge process. Additionally, while it introduces significant improvements, it does not represent a groundbreaking change, hence a rating of 4 is appropriate.

[+] Read More

PR#1131 - feat: Added support for total files and processing status in executionopen

4_/5

Chandrasekharan M (chandrasekharan-zipstack)Created: 2025-02-13

The pull request introduces a significant feature by adding support for tracking the total number of files and their processing status in workflow executions. This enhancement provides valuable insights into execution progress, which can be beneficial for monitoring and debugging. The implementation includes schema migrations, updates to serializers, and necessary changes in the workflow execution logic. The changes are well-documented and tested locally, ensuring that new records compute values as expected. However, the PR could benefit from more extensive testing details or additional unit tests to ensure robustness. Overall, it's a well-executed and meaningful improvement to the project.

[+] Read More

Quantify commits

Quantified Commit Activity Over 14 Days

Developer	Branches	PRs	Commits	Files	Changes
Hari John Kuriakose	2	0/0/0	2	4	4796
harini-venkataraman	3	2/2/1	6	23	3726
pre-commit-ci[bot]	4	0/0/0	6	11	3553
Praveen Kumar	3	2/1/1	8	19	1853
Chandrasekharan M	5	8/5/0	20	60	1522
Gayathri	4	3/2/0	16	48	660
Tahier Hussain	3	5/4/1	8	7	167
Athul	1	1/1/0	1	2	74
Hari John Kuriakose (hari-kuriakose)	1	1/0/0	1	1	39
vishnuszipstack	1	0/0/0	3	5	16
Ritwik G	2	0/0/0	3	5	13
jagadeeswaran-zipstack	1	1/1/0	1	1	2
Deepak K	0	0/0/0	0	0	0

_{PRs: created by that dev and opened/merged/closed-unmerged during the period}

Quantify risks

Project Risk Ratings

Risk	Level (1-5)	Rationale
Delivery	3	The project faces moderate delivery risks due to several unresolved issues and dependencies. Issues such as #1110 and #239 highlight challenges in cross-platform compatibility, which could delay delivery if not addressed. The backlog of unresolved issues and the complexity of certain pull requests, like the unified notifications system (#1089), further contribute to this risk. Additionally, the dependency on external services like Redis and Azure introduces potential points of failure that could impact delivery timelines.
Velocity	3	The project's velocity is moderate, with a healthy level of commit activity but some bottlenecks in integration. Developers like Chandrasekharan M show high engagement, but the lack of merged pull requests suggests potential delays in code review or integration processes. The varied levels of contribution among developers and reliance on automated tools like pre-commit-ci[bot] also pose risks to maintaining consistent velocity.
Dependency	4	The project has significant dependency risks due to its reliance on multiple external services and libraries. The use of Redis for message storage, Azure for cloud services, and various SDK updates introduce potential points of failure. The frequent version bumps in dependencies without comprehensive testing notes exacerbate this risk, as any issues with these dependencies could lead to failures in the system.
Team	3	The team faces moderate risks related to communication and issue management. The low level of comments on issues and minimal labeling suggest potential communication challenges. However, the active maintenance and resolution of certain pull requests indicate effective collaboration in some areas. The presence of unresolved issues and bottlenecks in code review processes could lead to team burnout or conflict if not managed properly.
Code Quality	3	Code quality is at moderate risk due to the complexity of recent pull requests and the lack of detailed testing documentation. While there are efforts to improve code quality through refactoring and optimizations, the absence of comprehensive testing details increases the likelihood of undetected bugs. Issues like document indexing (#1043) and workflow execution errors (#595) further highlight underlying stability problems that need addressing.
Technical Debt	4	Technical debt is a significant risk due to recurring issues and complex migrations without automation. The backlog of unresolved issues and the need for schema migrations in several pull requests indicate accumulating technical debt. The reliance on manual processes for version upgrades and data migration (#1093) further contributes to this risk, potentially delaying delivery timelines if not addressed efficiently.
Test Coverage	4	Test coverage poses a significant risk as many pull requests lack detailed testing notes or documentation. This gap increases the likelihood of undetected bugs affecting delivery timelines and code stability. The reliance on automated tools for maintaining code quality without comprehensive test coverage exacerbates this risk, as seen with recent SDK updates lacking thorough testing.
Error Handling	3	Error handling is at moderate risk due to insufficient documentation and reliance on try-catch blocks for dynamic imports. While there are efforts to manage errors through hooks like useExceptionHandler, the lack of explicit error handling frameworks or libraries raises concerns about the effectiveness of error management strategies. Unresolved issues related to error handling (#1044) further highlight areas needing improvement.

Detailed Reports

Report On: Fetch issues

Recent Activity Analysis

Recent GitHub issue activity for the Zipstack/unstract project shows a mix of bug reports, feature requests, and user inquiries. Notably, there are several issues related to installation and setup challenges, particularly on Windows and macOS environments (#1110, #239). There are also recurring themes around version upgrades and data migration (#1093), as well as issues with specific functionalities like document indexing and workflow execution (#1043, #595). The project seems to be actively maintained, with responses from developers addressing user concerns and providing updates or workarounds.

Notable Issues

Windows Setup Issues (#1110): Users face challenges setting up the platform on Windows, indicating potential gaps in cross-platform support or documentation.
Version Upgrade Complications (#1093): Users encounter difficulties when upgrading versions while retaining existing data, highlighting the need for clearer migration guides.
Indexing Problems (#1043): Issues with document indexing suggest possible bugs in the integration with vector databases like Qdrant.
Workflow Execution Errors (#595): Errors during workflow execution point to potential stability issues in the API or ETL pipeline processes.

Themes and Commonalities

Cross-Platform Compatibility: Several issues indicate challenges with running the platform on different operating systems, suggesting a need for improved cross-platform support.
Data Migration and Versioning: Users frequently report issues related to upgrading versions and migrating data, indicating a need for streamlined processes or better documentation.
Integration Challenges: Problems with integrating external services like LLMs or vector databases are common, pointing to potential areas for improving compatibility or error handling.

Issue Details

Most Recently Created Issues

#1110: "fix: Error running in windows" - Priority: High; Status: Open; Created 15 days ago; Updated 2 days ago.
#1093: "How to do a version upgrade and keep the existing projects" - Priority: Medium; Status: Open; Created 19 days ago; Updated 10 days ago.

Most Recently Updated Issues

#1110: "fix: Error running in windows" - Updated 2 days ago.
#1044: "| filepath | function | $$\textcolor{#23d18b}{\tt{passed}}$$ | SUBTOTAL |" - Priority: Low; Status: Stale; Created 45 days ago; Updated 11 days ago.

The ongoing activity suggests a responsive development team actively engaging with user-reported issues. However, certain recurring themes such as setup difficulties and integration challenges indicate areas where further improvements could enhance user experience.

Report On: Fetch pull requests

Analysis of Pull Requests for Zipstack/unstract

Open Pull Requests

1. #1134: Add issue template config

State: Open
Created by: Hari John Kuriakose
Created: 0 days ago
Summary:
- Introduces an issue template configuration to disable blank issue creation and improve readability.
- The PR does not break existing features and has passed quality checks.
- Notable Issues: The Contributor License Agreement (CLA) status is pending, which might delay merging.

2. #1133: SDK version roll to 0.57.0rc5

State: Open
Created by: Gayathri
Created: 1 day ago
Summary:
- Updates the SDK version to 0.57.0rc5.
- The PR is straightforward but lacks detailed explanations in the "How" section.
- Quality checks have passed, indicating no immediate issues.

3. #1132: feat: Updated status enum to use TextChoices for 3 models

State: Open
Created by: Chandrasekharan M
Created: 2 days ago
Summary:
- Updates status fields for several models to use TextChoices, improving consistency.
- Involves schema migrations and depends on another PR (#1131).
- Quality checks have passed, but review comments suggest further code cleanup.

4. #1131: feat: Added support for total files and processing status in execution

State: Open
Created by: Chandrasekharan M
Created: 2 days ago
Summary:
- Adds fields for tracking total, successful, and failed files in execution models.
- Includes schema migrations and is based on another PR (#1129).
- Quality checks have passed, but there are discussions about potential improvements in code logic.

5. #1129: feat: Updated execution time in WF and file execution models

State: Open
Created by: Chandrasekharan M
Created: 4 days ago
Summary:
- Updates execution time calculations and adds data migrations for older records.
- Quality checks have passed, but there are open issues related to new problems detected.

Recently Closed Pull Requests

Notable Closed PRs:

#1128: Version bump for unstract-sdk 0.57.0rc4
- Merged successfully after addressing review comments regarding version updates across multiple files.
#1127: [FIX] Pass the right env value to the tool container while spawning
- Addressed an issue with incorrect environment variable paths during tool container spawning.
- Successfully merged after ensuring correct file paths were used.
#1126: refactor: Dockerfile optimization to build faster
- Optimized Dockerfiles for faster builds, achieving an approximately 11.67% speed improvement.
- Successfully merged after resolving SonarCloud issues.
#1116: hotfix: UN-2107 MIME type validation for large files
- Fixed MIME type detection issues for large JSON files.
- Successfully merged as a hotfix after ensuring correct MIME type handling.

Observations and Recommendations

Pending CLA Status:
- For #1134, ensure that the Contributor License Agreement is signed to avoid delays in merging.
Code Cleanup:
- Several open PRs (#1132, #1131) have review comments suggesting code cleanup and refactoring opportunities. Addressing these could improve code quality and maintainability.
Version Management:
- Recent closed PRs involved version bumps (e.g., #1128). Ensure consistent version management across all components to prevent dependency conflicts.
Testing and Quality Checks:
- Most PRs have passed quality checks, indicating a robust CI/CD pipeline. Continue maintaining this standard to ensure high-quality code integration.
Documentation:
- Ensure that all changes, especially those involving migrations or significant feature additions (e.g., #1131), are well-documented both in the codebase and external documentation if applicable.

Overall, the project appears to be actively maintained with a focus on continuous improvement and optimization, as evidenced by recent PR activities.

Report On: Fetch Files For Assessment

Source Code Assessment

`backend/pdm.lock`

Purpose: This file is auto-generated by PDM and contains the exact versions of all dependencies used in the project. It ensures consistent environments across different setups.
Structure: The file is structured in TOML format, listing each package with its version, Python compatibility, summary, groups, dependencies, and associated files with hashes for integrity verification.
Quality: The file is comprehensive and well-organized, typical of a lock file. It provides detailed dependency management, ensuring reproducibility of the environment.
Observations:
- The presence of both default and dev groups indicates a separation between runtime and development dependencies.
- Multiple versions of some packages are listed with different extras or conditions, which may indicate complex dependency requirements.

`backend/pyproject.toml`

Purpose: Defines the project's metadata, dependencies, and configuration for PDM.
Structure: Contains sections for build-system requirements, project metadata (name, version, description), dependencies, Python version constraints, and development dependencies.
Quality: The file is concise and follows the standard TOML structure for Python projects. It clearly defines the project's setup requirements and additional tools needed for development and testing.
Observations:
- The use of local dependencies with relative paths suggests a modular project structure.
- The Python version constraint (>=3.9,<3.11.1) is specific, likely due to compatibility issues with newer versions.

`backend/sample.env`

Purpose: Provides a template for environment variables required to configure the application.
Structure: Key-value pairs defining various configuration settings like database credentials, API keys, service URLs, etc.
Quality: The file is well-organized with comments explaining certain configurations. It uses placeholders for sensitive information.
Observations:
- Security-sensitive values like DJANGO_SECRET_KEY are included as placeholders; care should be taken to ensure these are not hard-coded in production environments.
- The file covers a wide range of configurations indicating a complex application setup.

`frontend/src/components/navigations/side-nav-bar/SideNavBar.jsx`

Purpose: Implements the side navigation bar component for the frontend using React.
Structure: Utilizes React hooks and Ant Design components to create a dynamic sidebar with menu items based on user session data.
Quality: The code is modular and leverages React's component-based architecture effectively. It includes error handling for optional plugins.
Observations:
- Dynamic imports for plugins suggest extensibility in the navigation system.
- The use of hooks like useMemo optimizes performance by avoiding unnecessary recalculations.

`frontend/src/components/navigations/top-nav-bar/TopNavBar.jsx`

Purpose: Implements the top navigation bar component for the frontend using React.
Structure: Similar to the side nav bar, it uses React hooks and Ant Design components to manage user interactions and display session information.
Quality: The code is cleanly written with logical separation of concerns. It handles various user roles and states effectively.
Observations:
- Conditional rendering based on user roles enhances UX by showing relevant options only.
- Error handling for plugin availability is consistent with best practices.

`prompt-service/src/unstract/prompt_service/helper.py`

Purpose: Contains helper functions related to prompt processing in the backend service.
Structure: Includes functions for plugin loading, context cleaning, prompt construction, and execution logic.
Quality: The code is well-documented with clear function definitions. It shows good use of Python typing hints for better readability and maintenance.
Observations:
- Use of global variables (plugins) could be reconsidered for thread safety if this service is multi-threaded or multi-process.

`prompt-service/src/unstract/prompt_service/main.py`

Purpose: Main application logic for handling prompt-related requests in a Flask app.
Structure: Defines routes and middleware for authentication, request logging, error handling, and prompt processing logic.
Quality: The code follows Flask conventions well. It includes robust error handling and logging mechanisms.
Observations:
- Decorators are used effectively to manage cross-cutting concerns like authentication and logging.

`tools/classifier/src/config/properties.json`

Purpose: Configuration file defining properties for a file classification tool.
Structure: JSON format specifying schema version, display name, function name, tool version, input/output descriptions, adapter settings, I/O compatibility, and restrictions.
Quality: Well-defined structure providing clear configuration details necessary for tool integration.
Observations:
- The configuration supports multiple adapters but currently enables only specific ones (e.g., text extractors).

`tools/text_extractor/src/config/properties.json`

Purpose: Configuration file defining properties for a text extraction tool.
Structure: Similar to the classifier configuration but tailored to text extraction specifics like input/output formats and adapter settings.
Quality: Consistent with other tool configurations in structure and detail level.
Observations:
- Restrictions on file size indicate consideration of performance or resource constraints.

`unstract/tool-registry/tool_registry_config/public_tools.json`

Purpose: Registry configuration listing public tools available in the platform with their properties and specifications.
Structure: JSON format detailing each tool's UID, properties (similar to individual tool configs), specifications, variables, icons, and Docker image details.
Quality: Comprehensive registry allowing easy extension or modification of available tools within the platform.
Observations:
- Use of SVG icons embedded as strings suggests flexibility in UI representation but could be optimized by referencing external assets.

Overall, the codebase demonstrates strong adherence to modern software engineering practices with clear separation of concerns across components. Dependency management through PDM ensures reproducibility while environment configurations are handled securely through .env files. Both frontend and backend components show thoughtful design considerations around extensibility and maintainability.

Report On: Fetch commits

Development Team and Recent Activity

Team Members and Their Recent Activities

Praveen Kumar (pk-zipstack)
- Worked on version bumping for SDKs and tools, including updates to pdm.lock files.
- Made changes across multiple branches, focusing on dependency management and integration.
- Active in 3 branches with 8 commits affecting 19 files.
Tahier Hussain (tahierhussain)
- Focused on frontend improvements, including UI/UX enhancements and dynamic content handling.
- Contributed to the unification of notifications and subscription plugin fixes.
- Active in 3 branches with 8 commits affecting 7 files.
Gayathri (gaya3-zipstack)
- Involved in fixing issues related to remote storage and SDK version rollouts.
- Addressed unit test failures and pre-commit issues.
- Active in 4 branches with 16 commits affecting 48 files.
Chandrasekharan M (chandrasekharan-zipstack)
- Worked on backend optimizations, including Dockerfile improvements and execution model updates.
- Enhanced workflow execution models with additional fields and enums.
- Active in 5 branches with 20 commits affecting 60 files.
Harini Venkataraman (harini-venkataraman)
- Focused on passing execution sources and dependency version bumps.
- Contributed to backend reload support and integration fixes.
- Active in 3 branches with 6 commits affecting 23 files.
Athul (athul-rs)
- Made a single commit related to subscription usage support.
- Active in 1 branch with changes affecting 2 files.
Jagadeeswaran (jagadeeswaran-zipstack)
- Fixed UI label issues in the frontend.
- Active in 1 branch with changes affecting 1 file.
Hari John Kuriakose (hari-kuriakose)
- Updated echo commands for preserving special characters in reports.
- Active in 1 branch with changes affecting 1 file.
Ritwik G (ritwik-g)
- Focused on initial E2E test setup, including Docker setup adjustments.
- Active in 2 branches with changes affecting 5 files.
Vishnu S (vishnuszipstack)
- Worked on prompt studio enhancements, including line number additions.
- Active in 1 branch with changes affecting 5 files.
Pre-commit-ci[bot]
- Automated updates for pre-commit hooks and dependency management.
- Active in multiple branches with changes affecting various files.

Patterns, Themes, and Conclusions

Version Management: There is a strong focus on updating SDK versions and managing dependencies across multiple services, indicating ongoing maintenance and improvement efforts.
Frontend Enhancements: Several team members are actively working on improving the user interface and experience, suggesting a priority on usability and user engagement.
Backend Optimization: Efforts are being made to optimize backend processes, including Dockerfile optimizations for better caching and execution model enhancements for improved performance tracking.
Collaboration: Many commits are co-authored or involve merging from other branches, highlighting collaborative efforts within the team to integrate changes effectively.
Testing and Integration: There is an emphasis on testing, as seen from the setup of E2E tests and fixing unit test failures, ensuring reliability of the platform.

Overall, the development team is actively engaged in maintaining the platform's robustness while enhancing its features for better user experience and performance.