‹ Reports
The Dispatch

OSS Report: deepset-ai/haystack


Haystack Development Sees Increased Focus on Asynchronous Capabilities and Documentation Improvements

Haystack, an end-to-end framework for building applications powered by large language models, has recently seen a notable increase in development activity, particularly around enhancing asynchronous capabilities and improving documentation. The project is spearheaded by deepset-ai and is widely used for tasks such as retrieval-augmented generation and semantic search.

Recent efforts have concentrated on resolving compatibility issues, enhancing user experience through better documentation, and introducing asynchronous execution in components to improve performance. The development team has been actively addressing bugs, implementing new features, and refining existing functionalities to ensure the framework remains robust and user-friendly.

Recent Activity

Recent issues and pull requests highlight a concerted effort to address compatibility concerns and enhance functionality. Notable issues include #8284 regarding dependency conflicts with ChromaDB and #8280, which addresses a bug allowing invalid input types in components. These issues indicate ongoing challenges with dependency management and component reliability.

The development team has been active in both feature development and maintenance tasks:

  1. David S. Batista - Updated sentence_window_retriever.py with linting fixes; contributed to DocumentBuilder branch.
  2. Souf G (gsouf) - Fixed Discord link in README.md.
  3. Stefano Fiorucci (anakin87) - Made metadata JSON serializable; refactored utility functions.
  4. Jon Strutz (jonstrutz11) - Implemented fix for DOCX page breaks.
  5. Sebastian Husch Lee (sjrl) - Added min_top_k feature; extensive refactoring.
  6. Daria Fokina (dfokina) - Cleaned up docstrings; updated documentation.
  7. Madeesh Kannan (shadeMe) - Added async support; engaged in refactoring.
  8. Agnieszka Marzec (agnieszka-m) - Improved documentation across components.
  9. Vladimir Blagojevic (vblagoje) - Updated dependencies; addressed compatibility issues.
  10. Silvano Cerza (silvanocerza) - Enhanced pipeline execution logic.
  11. Marie-Luise Klaus (faymarie) - Fixed serialization issues in ChatPromptBuilder.
  12. Corentin Meyer (lambda-science) - Addressed document processing bugs.
  13. Tim Wellbrock (twellck) - Added document cleaning features.
  14. Haystack Bot - Automated dependency updates.

Of Note

  1. Asynchronous Execution: PR #8279 introduces asynchronous methods, significantly enhancing component performance and responsiveness.
  2. Zero-Shot Classification: PR #8193 adds zero-shot document classification, expanding Haystack's processing capabilities.
  3. Dependency Upgrades: Multiple PRs focus on updating dependencies like NLTK to ensure compatibility and security.
  4. Documentation Enhancements: Numerous PRs aim to improve documentation clarity, aiding user onboarding and community contributions.
  5. Community Engagement: Active discussions on best practices and backward compatibility demonstrate a collaborative development environment.

Overall, Haystack's recent activities reflect a dynamic development phase focused on performance improvements, feature expansion, and community-driven enhancements, positioning it well for continued growth and adaptation in AI-driven applications.

Quantified Reports

Quantify Issues



Recent GitHub Issues Activity

Timespan Opened Closed Comments Labeled Milestones
7 Days 23 11 28 6 1
30 Days 103 87 93 47 3
90 Days 241 184 317 83 6
1 Year 326 200 482 89 7
All Time 3479 3345 - - -

Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.

Quantify commits



Quantified Commit Activity Over 30 Days

Developer Avatar Branches PRs Commits Files Changes
David S. Batista 3 4/2/1 28 18 1370
Silvano Cerza 5 5/4/0 16 21 1350
Madeesh Kannan 3 2/2/0 8 15 1086
Stefano Fiorucci 3 9/8/0 10 29 856
Vladimir Blagojevic 3 5/3/2 14 14 830
Daria Fokina 1 17/17/0 17 20 812
Agnieszka Marzec 1 17/18/0 18 20 694
Sebastian Husch Lee 1 2/2/0 2 12 382
Amna Mubashar 3 4/3/1 4 9 244
Nicola Procopio 1 1/2/0 2 8 165
Mo Sriha (medsriha) 1 1/0/0 4 3 131
Tim Wellbrock 1 1/1/0 1 3 126
Corentin Meyer 1 0/2/0 2 6 88
Jon Strutz 1 1/1/0 1 4 57
Marie-Luise Klaus 1 2/2/0 2 6 50
Tobias Wochinger 1 1/1/0 1 2 6
dependabot[bot] 1 2/2/0 2 3 6
Souf G 1 1/1/0 1 1 2
Haystack Bot 1 1/1/0 1 1 2
Ulises M (lbux) 0 1/0/0 0 0 0
None (jlonge4) 0 1/0/1 0 0 0
None (jpatra72) 0 1/0/0 0 0 0
Ikko Eltociear Ashimine (eltociear) 0 1/0/0 0 0 0
Carlos Fernández (CarlosFerLo) 0 1/0/1 0 0 0
keval dekivadiya (kevaldekivadiya2415) 0 1/0/1 0 0 0

PRs: created by that dev and opened/merged/closed-unmerged during the period

Detailed Reports

Report On: Fetch issues



Recent Activity Analysis

The GitHub repository for the Haystack project has seen a notable uptick in activity, with 134 open issues currently being tracked. Recent discussions indicate a focus on improving compatibility and functionality, particularly regarding integration with various models and document stores. A significant number of issues revolve around bugs, documentation improvements, and feature requests, reflecting an active engagement from the community.

Several issues highlight critical concerns, such as compatibility problems between different versions of dependencies (e.g., sentence-transformers), which could lead to runtime errors. Additionally, there are multiple discussions about enhancing the documentation to clarify usage and implementation details for various components, indicating that users may be struggling with the current state of the documentation.

Issue Details

Recent Issues

  1. Issue #8284: ChromaDB stuck

    • Priority: Community Triage
    • Status: Open
    • Created: 2 days ago
    • Updated: 1 day ago
    • Summary: User reports conflicting dependencies related to farm-haystack and haystack-ai, leading to confusion about which version to use.
  2. Issue #8281: 🧪 Tools: experiment with different use cases

    • Priority: Not specified
    • Status: Open
    • Created: 2 days ago
    • Summary: Proposal to explore various use cases to ensure component compatibility and optimal user experience.
  3. Issue #8280: component.set_input_types() allows non-existing inputs

    • Priority: Bug
    • Status: Open
    • Created: 2 days ago
    • Summary: Describes a bug where invalid input types can be set in components, leading to unexpected behavior.
  4. Issue #8279: SentenceWindowRetriever: option to return docs instead of merged text

    • Priority: Not specified
    • Status: Open
    • Created: 3 days ago
    • Summary: Feature request to modify the retriever's output format for better usability.
  5. Issue #8276: Option to enable structured outputs with OpenAI Generators

    • Priority: P2
    • Status: Open
    • Created: 3 days ago
    • Summary: Suggestion to support structured outputs in OpenAI generators, enhancing flexibility in responses.

Important Issues by Priority

  • High Priority Issues (P1):
  • Issue #8258: Add top_k parameter to ChatMessageRetriever.
  • Issue #8255: OpenAIGenerator uses chat_completions endpoint causing errors.

  • Medium Priority Issues (P2):

  • Issue #8276: Option for structured outputs in OpenAI Generators.
  • Issue #8280: Bug related to setting input types incorrectly.

  • Documentation Issues: Several issues (e.g., #8262, #8261) focus on improving documentation clarity regarding component usage and expected parameters.

Summary of Key Themes

  • There is a clear emphasis on resolving compatibility issues between different versions of libraries and ensuring that users have clear guidance on how to implement features correctly.
  • Many recent issues are centered around enhancing user experience through improved documentation and more robust error handling.
  • The community is actively engaged in proposing new features and improvements, indicating a healthy ecosystem around the Haystack project.

This analysis highlights both the challenges faced by users and the proactive steps being taken by contributors to enhance the framework's usability and functionality.

Report On: Fetch pull requests



Report on Pull Requests

Overview

The analysis covers the latest pull requests (PRs) from the deepset-ai/haystack repository, focusing on their state, proposed changes, and significance within the context of the project. The repository currently has 15 open PRs and a history of closed PRs that demonstrate ongoing development and community engagement.

Summary of Pull Requests

  1. PR #8285: chore: update pipeline.py

    • State: Open
    • Created: 1 day ago
    • Proposed Changes: Minor fix in pipeline.py.
    • Significance: Maintains code quality with small adjustments.
  2. PR #8283: initial import

    • State: Open
    • Created: 2 days ago
    • Proposed Changes: Initial import related to documentation updates.
    • Significance: Enhances documentation for new features.
  3. PR #8279: feat: Extend core component machinery to support an async run method

    • State: Open
    • Created: 2 days ago
    • Proposed Changes: Adds support for asynchronous execution in components.
    • Significance: Improves performance and responsiveness of components.
  4. PR #8256: fix: 1.x - nltk upgrade, use nltk.download('punkt_tab')

    • State: Open
    • Created: 5 days ago
    • Proposed Changes: Updates NLTK dependencies and addresses compatibility issues.
    • Significance: Ensures continued functionality with updated libraries.
  5. PR #8244: feat: Expose default_headers and add kwargs for Azure Client

    • State: Open
    • Created: 10 days ago
    • Proposed Changes: Adds optional parameters for Azure Client configuration.
    • Significance: Enhances flexibility for Azure integrations.
  6. PR #8233: feat: Add current date in UTC to PromptBuilder

    • State: Open
    • Created: 10 days ago
    • Proposed Changes: Introduces a function to get the current UTC date in templates.
    • Significance: Improves template functionality for dynamic content.
  7. PR #8193: feat: Adds support for zero-shot document classification (#7669)

    • State: Open
    • Created: 14 days ago
    • Proposed Changes: Implements zero-shot classification capabilities.
    • Significance: Expands the framework's capabilities in document processing.
  8. PR #8176: feat: Add unsafe init arg in ConditionalRouter and OutputAdapter

    • State: Open
    • Created: 17 days ago
    • Proposed Changes: Reintroduces unsafe behavior options for certain components.
    • Significance: Provides backward compatibility while allowing advanced configurations.
  9. PR #8079: feat: Added JSONToDocument component in converter components

    • State: Open
    • Created: 31 days ago
    • Proposed Changes: Introduces a new component for converting JSON data into Document objects.
    • Significance: Enhances data handling capabilities within the framework.
  10. Various PRs focused on cleaning up docstrings and improving documentation across different components (e.g., PRs #8229, #8219, etc.). These changes are crucial for maintaining clarity and usability of the codebase as it evolves.

Analysis of Pull Requests

The recent activity in the deepset-ai/haystack repository highlights several key themes:

  1. Enhancements to Asynchronous Capabilities: The introduction of asynchronous methods (as seen in PR #8279) is a significant step towards improving performance, especially as applications scale and require non-blocking operations.

  2. Dependency Management and Upgrades: Several PRs focus on upgrading dependencies (e.g., PR #8256). This is critical for maintaining security and compatibility with external libraries, particularly when dealing with NLP tools like NLTK that frequently update their APIs.

  3. Feature Additions: New features such as zero-shot classification (PR #8193) and enhancements to existing components (like the Azure Client) reflect a commitment to expanding the functionality of Haystack, making it more versatile for users across different domains.

  4. Documentation Improvements: A notable number of PRs are dedicated to cleaning up docstrings and enhancing documentation (e.g., PRs #8227, #8219). This is essential for fostering community contributions and ensuring that new users can effectively utilize the framework without extensive onboarding.

  5. Community Engagement: The repository shows active discussions among contributors regarding best practices, such as handling deprecated methods (e.g., PR #8146) and ensuring backward compatibility (e.g., PR #8176). This collaborative spirit is vital for sustaining an open-source project.

  6. Backwards Compatibility vs New Features: There is a balancing act between introducing new features and maintaining backward compatibility, as seen with the deprecation of certain methods while providing alternatives (e.g., PR #8206).

  7. Testing Focus: Many recent PRs include unit tests or mention testing as part of their changes, indicating a strong emphasis on quality assurance within the development process.

In summary, the pull requests reflect a robust development cycle characterized by feature enhancement, dependency management, community collaboration, and a strong focus on documentation and testing practices. This positions Haystack well for future growth and adaptation in an evolving landscape of AI applications.

Report On: Fetch commits



Repo Commits Analysis

Development Team and Recent Activity

Team Members and Recent Contributions:

  1. David S. Batista

    • Recent Activity:
    • Updated sentence_window_retriever.py with linting fixes and added release notes.
    • Contributed significantly to the DocumentBuilder branch with multiple commits focusing on tests, refactoring, and cleaning up code.
    • Active in merging branches and maintaining documentation.
  2. Souf G (gsouf)

    • Recent Activity:
    • Fixed the Discord link in the README.md.
  3. Stefano Fiorucci (anakin87)

    • Recent Activity:
    • Worked on making metadata produced by DOCXToDocument JSON serializable.
    • Refactored utility functions for document store deserialization.
    • Contributed to various bug fixes and enhancements across multiple components.
  4. Jon Strutz (jonstrutz11)

    • Recent Activity:
    • Implemented a fix for extracting page breaks from DOCX files.
  5. Sebastian Husch Lee (sjrl)

    • Recent Activity:
    • Added the min_top_k feature to the TopPSampler.
    • Engaged in extensive refactoring and testing of various components.
  6. Daria Fokina (dfokina)

    • Recent Activity:
    • Cleaned up docstrings across multiple components.
    • Contributed to documentation updates and enhancements.
  7. Madeesh Kannan (shadeMe)

    • Recent Activity:
    • Added async support to core components.
    • Engaged in various refactoring tasks and maintenance of existing features.
  8. Agnieszka Marzec (agnieszka-m)

    • Recent Activity:
    • Focused on cleaning up docstrings and improving documentation across multiple components.
  9. Vladimir Blagojevic (vblagoje)

    • Recent Activity:
    • Engaged in updating dependencies and addressing issues related to component compatibility.
  10. Silvano Cerza (silvanocerza)

    • Recent Activity:
    • Worked on enhancing pipeline execution logic and fixing bugs related to component execution order.
  11. Marie-Luise Klaus (faymarie)

    • Recent Activity:
    • Fixed issues related to serialization in ChatPromptBuilder.
  12. Corentin Meyer (lambda-science)

    • Recent Activity:
    • Addressed bugs in document processing components.
  13. Tim Wellbrock (twellck)

    • Recent Activity:
    • Added features related to document cleaning processes.
  14. Haystack Bot

    • Automated dependency updates through Dependabot.

Patterns, Themes, and Conclusions:

  • The team shows a strong focus on enhancing existing features, particularly around document processing capabilities such as DOCXToDocument and sentence_window_retriever.
  • There is a significant emphasis on documentation improvements, indicating a commitment to maintainability and usability for future developers.
  • Collaborative efforts are evident, with many co-authored commits suggesting a culture of peer review and teamwork.
  • The activity reflects ongoing efforts to modernize the codebase, including async support and refactoring for better performance.
  • Several members are actively involved in both feature development and bug fixing, showcasing versatility within the team.
  • The recent contributions indicate a proactive approach towards addressing technical debt while also implementing new features that enhance the framework's capabilities.

Overall, the development team is actively engaged in maintaining a high level of productivity, ensuring that Haystack remains robust and adaptable to user needs while fostering community contributions.