‹ Reports
The Dispatch

GitHub Repo Analysis: pathwaycom/pathway


Executive Summary

The Pathway project, managed by pathwaycom on GitHub, is a Python-based ETL framework designed for stream processing and real-time analytics. It integrates Python's ease-of-use with Rust's performance efficiency, making it suitable for both development and production environments. The project is in a state of active development, with a focus on enhancing performance and expanding integration capabilities.

Recent Activity

Team Members and Activities

  1. Kamil Piechowiak (KamilPiechowiak)

    • Recent work includes append-only UDFs and operator persistence improvements.
  2. Sergey Kulik (zxqfd555-pw)

    • Focused on test muting for investigation and performance improvements.
  3. Pathway-Dev

    • Regularly refreshes Pathway examples.
  4. Szymon Dudycz (szymondudycz)

  5. Berkecan Rizai (berkecanrizai)

    • Added Jenkins CI for RAG evaluations and enhanced BaseRAGQA functionalities.
  6. Pawel Podhajski (pw-ppodhajski)

    • Released Pathway version updates.

Patterns and Themes

Risks

Of Note

Quantified Reports

Quantify issues



Recent GitHub Issues Activity

Timespan Opened Closed Comments Labeled Milestones
7 Days 1 0 3 0 1
30 Days 2 0 3 0 1
90 Days 5 2 10 0 1
1 Year 67 35 194 1 1
All Time 69 36 - - -

Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.

Rate pull requests



2/5
The pull request primarily involves switching the runner for GitHub Actions workflows, which is a maintenance task rather than a feature or bug fix. It lacks detailed context or testing information, and the checklist is incomplete. The changes are minor, with more deletions than additions, indicating a simplification but not a significant improvement. The lack of related issues or thorough documentation further diminishes its impact.
[+] Read More
4/5
This pull request introduces a new feature by integrating Ruff, a fast and comprehensive Python linter and code formatter, into the GitHub Actions workflow. The change is significant as it enhances the project's code quality assurance process by replacing multiple existing tools with a single, more efficient one. The implementation is straightforward and well-contained within a new YAML file, ensuring minimal disruption to existing workflows. However, the PR lacks detailed testing information and does not address documentation updates or changelog modifications, which are important for transparency and future maintenance.
[+] Read More

Quantify commits



Quantified Commit Activity Over 14 Days

Developer Avatar Branches PRs Commits Files Changes
berkecanrizai 1 0/0/0 3 26 2343
Pathway-Dev 1 0/0/0 14 38 1724
Kamil Piechowiak 1 0/0/0 4 20 1361
Sergey Kulik 1 0/0/0 4 13 271
Pawel Podhajski 1 0/0/0 1 3 6
Szymon Dudycz 1 0/0/0 2 2 3
Christian Clauss (cclauss) 0 1/0/0 0 0 0
Sarim Ahmed (sarim2000) 0 0/0/1 0 0 0
Vishwanath Martur (vishwamartur) 0 0/0/1 0 0 0

PRs: created by that dev and opened/merged/closed-unmerged during the period

Quantify risks



Project Risk Ratings

Risk Level (1-5) Rationale
Delivery 4 The project faces significant delivery risks due to a backlog of unresolved issues, as evidenced by the 52% closure rate of issues over the past year. The prolonged open status of PR #68, which aims to switch to GitHub runners, further exacerbates these risks by delaying CI/CD improvements essential for efficient workflows. Additionally, discrepancies in issue and pull request tracking indicate potential misallocation of resources based on non-existent problems, affecting delivery timelines.
Velocity 4 Velocity is at risk due to bottlenecks in the review process, as indicated by the lack of pull request activity despite active commit contributions. The prolonged open status of critical PRs like #68 suggests delays in integrating essential changes. Additionally, the low number of milestones and unresolved critical issues like data integrity crashes (#73) highlight potential impediments to maintaining a steady development pace.
Dependency 3 Dependency risks are moderate, with efforts to broaden data source compatibility through features like the Apache Iceberg input connector (#77). However, unresolved issues related to S3-compatible services (#71) and prolonged PR statuses suggest potential challenges in managing external dependencies effectively. The reliance on key individuals for significant contributions also poses a risk if these contributors become unavailable.
Team 3 The team faces moderate risks related to coordination and communication, as indicated by the low number of milestones and potential misallocation of resources due to discrepancies in issue and pull request tracking. The concentrated activity by a few developers suggests dependency on key individuals, which could lead to burnout or bottlenecks if not managed properly.
Code Quality 3 Code quality is moderately at risk due to incomplete documentation and testing information in significant PRs like #79. While the introduction of Ruff enhances code quality assurance, the lack of detailed context and testing raises concerns about maintainability. Additionally, discrepancies in reported pull requests suggest potential oversight in ensuring thorough integration and review processes.
Technical Debt 3 Technical debt is moderately concerning due to ongoing optimizations and modularization efforts that help manage complexity. However, unresolved critical issues like data integrity crashes (#73) and the absence of detailed testing information in commits indicate areas where technical debt could accumulate if not addressed promptly.
Test Coverage 4 Test coverage is at risk due to the lack of detailed testing information in significant PRs like #79 and unresolved critical issues such as data integrity crashes (#73). The absence of comprehensive testing practices for new features and changes could lead to undetected bugs and regressions, impacting system reliability.
Error Handling 4 Error handling is at risk due to unresolved critical issues like data integrity crashes (#73) that highlight gaps in error management strategies. While there are mechanisms for error logging and reporting, the reliance on hardcoded expected error messages in tests suggests potential maintenance challenges if underlying logic changes. This could affect system reliability if not addressed effectively.

Detailed Reports

Report On: Fetch issues



Recent Activity Analysis

Recent GitHub issue activity for the Pathway project includes a mix of enhancements, bug reports, and user questions. Notable issues include requests for new features like Apache Iceberg input connectors (#77) and improvements such as enabling Link-Time Optimization (LTO) for release builds (#75). There are also several bug reports, including data integrity issues with persistency (#73) and support for S3-compatible services other than AWS (#71).

A significant theme among the issues is the focus on enhancing performance and expanding connector support, indicating ongoing efforts to improve the framework's scalability and integration capabilities. Additionally, there are several user questions and requests for documentation improvements, highlighting areas where users seek more clarity or functionality.

Issue Details

  • #77: Add Apache Iceberg input connector

    • Priority: Enhancement
    • Status: Open
    • Created: 18 days ago
    • Notable for its potential to expand the framework's data source compatibility.
  • #75: Enable Link-Time Optimization (LTO) for Release builds

    • Priority: Enhancement
    • Status: Open
    • Created: 31 days ago, Edited: 27 days ago
    • Significant due to its implications for performance optimization.
  • #73: [Bug]: Persistency causes data integrity issues, leading to a crash

    • Priority: Bug
    • Status: Open
    • Created: 67 days ago
    • Critical as it affects data reliability in streaming applications.
  • #71: [Bug]: Support for S3-compatible services other than AWS

    • Priority: Bug
    • Status: Open
    • Created: 84 days ago, Edited: 83 days ago
    • Important for users relying on non-AWS S3 services.
  • #69: How to retrieve the whole document for a chunk?

    • Priority: Question
    • Status: Open
    • Created: 123 days ago, Edited: 117 days ago
    • Reflects user interest in optimizing data retrieval processes.

These issues reflect ongoing development efforts to enhance Pathway's functionality and address user concerns, particularly around performance and integration with external systems.

Report On: Fetch pull requests



Analysis of Pull Requests for Pathway Project

Open Pull Requests

PR #79: GitHub Actions: Lint Python code with Ruff

  • State: Open
  • Created: 2 days ago by Christian Clauss
  • Description: This PR introduces Ruff, a fast Python linter and code formatter written in Rust, to the project's GitHub Actions. Ruff can replace multiple tools like Flake8, Black, isort, etc., offering over 800 linting rules.
  • Notable Points:
    • The introduction of Ruff could significantly streamline the code linting process, making it faster and more efficient.
    • The checklist for this PR is incomplete; the author needs to ensure that the code follows the project's style and update the documentation if necessary.
    • This PR is relatively new and may require further review and testing before merging.

PR #68: chore: switch to Github runner

  • State: Open
  • Created: 160 days ago by Pawel Podhajski
  • Description: This PR aims to switch the project to use GitHub runners for CI/CD processes.
  • Notable Points:
    • The PR has been open for a significant amount of time (160 days), indicating potential issues or low priority.
    • There are no comments or updates on testing or related issues, which suggests a lack of progress or interest in this change.

Closed Pull Requests

PR #76: Allow setting query transformers in the BaseRAGQA

  • State: Closed (Not Merged)
  • Created by: Vishwanath Martur
  • Description: This PR proposed adding query transformation behavior to the BaseRAGQuestionAnswerer initialization. It was closed due to ongoing refactoring in that part of the app.
  • Notable Points:
    • The closure without merging suggests that the proposed changes were either redundant or not aligned with the current direction of the project.
    • The contributor was thanked for their effort and informed that similar functionality would be addressed differently in future updates.

PR #66: fix: metadata is none or not

  • State: Closed (Not Merged)
  • Created by: Sarim Ahmed
  • Description: This PR aimed to handle cases where metadata might be None, preventing potential runtime errors.
  • Notable Points:
    • Another approach using Pathway Table API was merged instead, suggesting a more robust solution was found.
    • The contributor was encouraged to review the alternative solution.

PR #74: Add change to wording

  • State: Closed (Merged)
  • Created by: Jake Roggenbuck
  • Description: A minor wording fix in the README file was proposed and merged successfully.
  • Notable Points:
    • This indicates active maintenance of documentation, ensuring clarity and accuracy.

PR #72: Introducing Pathway Guru on Gurubase.io

  • State: Closed (Merged)
  • Created by: Kursat Aktas
  • Description: This PR introduced a badge for Pathway Guru on Gurubase.io, an AI assistant for answering user questions about Pathway.
  • Notable Points:
    • The integration with Gurubase.io highlights efforts to enhance user support and engagement through AI-driven tools.
    • There were discussions about maintaining and updating this feature, showing proactive collaboration between contributors and maintainers.

Summary

The open pull requests indicate ongoing efforts to improve CI/CD processes (#68) and introduce more efficient code linting tools (#79). However, the long-standing nature of PR #68 suggests it might require reassessment or closure if no longer relevant.

Closed pull requests reveal active development and maintenance activities, including documentation improvements (#74) and feature enhancements (#72). Notably, some PRs were closed without merging due to alternative solutions being implemented (#66) or ongoing refactoring efforts (#76).

Overall, the project appears well-maintained with active contributions from various developers. However, attention should be given to stalled pull requests like #68 to ensure they align with current project priorities.

Report On: Fetch Files For Assessment



Source Code Assessment

1. CHANGELOG.md

  • Purpose: This file documents all notable changes made to the project, adhering to Semantic Versioning.
  • Structure: The changelog is well-organized with clear sections for each version, detailing added features, changes, fixes, and breaking changes.
  • Quality: The file is comprehensive and provides a detailed history of the project's evolution. It includes dates for each release, which helps in tracking the timeline of updates.
  • Observations: The changelog is up-to-date with entries for unreleased changes and versions up to 0.16.2. It effectively communicates the project's progress and any potential impacts on users due to breaking changes.

2. python/pathway/debug/__init__.py

  • Purpose: Contains debugging utilities for the Pathway framework.
  • Structure: The file is structured with functions and classes that facilitate debugging tasks like computing tables, converting tables to pandas DataFrames, and creating tables from various data formats.
  • Quality: The code is well-documented with docstrings explaining the purpose of functions. Type hints are used for function signatures, enhancing readability and maintainability.
  • Observations: The use of decorators like @check_arg_types indicates a focus on ensuring correct argument types, which is good practice for robustness. The file also includes deprecated functions with warnings, showing attentiveness to backward compatibility.

3. python/pathway/engine.pyi

  • Purpose: A type stub file providing type definitions and interfaces for the engine module.
  • Structure: Defines classes, enums, and functions used in the engine module, focusing on type safety and interface clarity.
  • Quality: The type annotations are comprehensive, covering various aspects of the engine's functionality. This is crucial for static type checking and improving code reliability.
  • Observations: The use of TypeVar and generics suggests a design that accommodates flexibility in data handling. The presence of detailed class and method definitions indicates thorough planning in API design.

4. python/pathway/internals/graph_runner/expression_evaluator.py

  • Purpose: Handles expression evaluation within the graph runner component of the framework.
  • Structure: Implements classes and methods for evaluating expressions in different contexts (e.g., row-wise evaluation).
  • Quality: The code is complex but well-organized, with clear separation of concerns across different evaluator classes. Use of abstract base classes (ABC) suggests a well-thought-out design pattern.
  • Observations: There is an emphasis on type safety and deterministic computation, as seen in methods like eval_expression. The file also integrates closely with other internal modules, indicating a modular architecture.

5. python/pathway/internals/graph_runner/path_storage.py

  • Purpose: Manages path storage within the graph runner, facilitating data management and storage strategies.
  • Structure: Defines a Storage class with methods to manipulate column paths and storage configurations.
  • Quality: The use of dataclasses simplifies the definition of storage structures. Cached properties optimize repeated access patterns.
  • Observations: The class provides methods for merging storages and updating paths, which are essential for dynamic data flow management within the framework.

6. python/pathway/tests/test_udf.py

  • Purpose: Contains tests for user-defined functions (UDFs), ensuring their correctness and functionality.
  • Structure: Utilizes pytest for structuring tests, with various test cases covering synchronous and asynchronous UDFs.
  • Quality: Tests are comprehensive, covering edge cases like caching strategies and deprecated features. Mocking is used effectively to simulate function calls without side effects.
  • Observations: The presence of parameterized tests indicates thorough coverage across different scenarios. Deprecated features are tested with appropriate warnings.

7. src/engine/dataflow.rs

  • Purpose: Core Rust file handling data flow logic within the Pathway framework's engine.
  • Structure: Implements modules for complex columns, operators, persistence, etc., using Rust's module system.
  • Quality: The code is robustly structured using Rust's strong typing system. Modules are logically separated to encapsulate specific functionalities like operators or persistence mechanisms.
  • Observations: Use of traits and generics suggests a flexible design that can accommodate various data types and operations efficiently.

8. src/engine/dataflow/operators.rs

  • Purpose: Defines operators used in data flow processing within the Rust engine.
  • Structure: Contains modules for specific operator implementations like external index handling or stateful reduction.
  • Quality: Operators are implemented using Rust's functional programming capabilities (e.g., closures), which enhances performance and readability.
  • Observations: The use of traits like ArrangeWithTypes indicates an extensible design allowing easy addition of new operators or modifications to existing ones.

9. src/engine/graph.rs

  • Purpose: Manages graph-related logic in the Rust engine, crucial for graph-based computations within Pathway.
  • Structure: Implements handles for various components (e.g., columns, tables) using macros to define structures efficiently.
  • Quality: The code leverages Rust's memory safety features extensively. Use of arenas for resource management suggests efficient handling of large datasets or complex graphs.
  • Observations: There is a strong emphasis on error handling and logging, as seen in components like ErrorLogger.

10. src/python_api.rs

  • Purpose: Integrates Python API with Rust backend, enabling seamless interaction between Python code and Rust engine components.
  • Structure: Implements Python bindings using PyO3 to expose Rust functionalities to Python scripts.
  • Quality: The integration appears well-designed with careful attention to type conversions between Python and Rust types.
  • Observations: Extensive use of PyO3 macros facilitates efficient binding generation while maintaining type safety across language boundaries.

Overall, the source code files exhibit high quality in terms of structure, documentation, and adherence to best practices in both Python and Rust development environments.

Report On: Fetch commits



Repo Commits Analysis

Development Team and Recent Activity

Team Members and Activities

  • Kamil Piechowiak (KamilPiechowiak)

    • Worked on append-only UDFs, fixing high CPU idle usage, removing annotations from assign_windows, and operator persistence.
    • Collaborated with Sergey Kulik on some commits.
    • Total of 4 commits with 1361 changes across 20 files.
  • Sergey Kulik (zxqfd555-pw)

    • Focused on muting tests for further investigation, upgrading OpenAI, updating Rust version, supporting simple deletions in DeltaLake connector, and other performance improvements.
    • Total of 4 commits with 271 changes across 13 files.
  • Pathway-Dev

    • Regularly refreshed Pathway examples.
    • Total of 14 commits with 1724 changes across 38 files.
  • Szymon Dudycz (szymondudycz)

    • Updated documentation and marked versions as dynamic in pyproject.toml.
    • Total of 2 commits with 3 changes across 2 files.
  • Berkecan Rizai (berkecanrizai)

    • Added Jenkins CI for RAG evaluations, made metadata column optional in vector store inputs, and added prompt template and context modifiers to BaseRAGQA.
    • Collaborated with Pawel Podhajski on some commits.
    • Total of 3 commits with 2343 changes across 26 files.
  • Pawel Podhajski (pw-ppodhajski)

    • Released Pathway version updates.
    • Total of 1 commit with 6 changes across 3 files.

Patterns and Themes

  • Frequent Updates: The team is actively maintaining the repository with frequent updates, especially in the form of daily example refreshes by Pathway-Dev.

  • Collaboration: There is evidence of collaboration among team members, particularly between Kamil Piechowiak and Sergey Kulik, as well as co-authored commits involving multiple contributors.

  • Focus Areas: Recent activities have focused on performance improvements, bug fixes, feature enhancements like append-only UDFs, and infrastructure updates such as Jenkins CI integration.

  • Testing and Documentation: There is ongoing work to improve testing stability (e.g., muting tests for investigation) and documentation updates to ensure clarity and up-to-date information.

Overall, the development team is actively engaged in both enhancing features and maintaining the stability of the project through regular updates and collaborative efforts.