The Dispatch Demo - deepset-ai/haystack

May 18, 2024, 10:30 a.m. UTC This report was generated by Dispatch AI

Executive Summary

The Haystack project, managed by deepset-ai, is an orchestration framework for building customizable, production-ready applications powered by large language models (LLMs). It integrates various components such as models, vector databases, and file converters into pipelines or agents that can interact with data. The project is particularly well-suited for tasks like retrieval-augmented generation (RAG), question answering, semantic search, and conversational agent chatbots. With a robust community and user base, including notable organizations like Apple, Netflix, and Nvidia, the project is actively maintained and on a positive trajectory.

Active Development: Frequent updates and improvements.
Community Engagement: Strong user base and community involvement.
Robust Functionality: Continuous enhancement of features and performance.
Documentation & Testing: Ongoing improvements in documentation and testing practices.

Recent Activity

Team Members

Stefano Fiorucci (anakin87)
Carlos Fernández (CarlosFerLo)
Sebastian Husch Lee (sjrl)
Vladimir Blagojevic (vblagoje)
David S. Batista (davidsbatista)
Massimiliano Pippi (masci)
Daria Fokina (dfokina)
Madeesh Kannan (shadeMe)
Silvano Cerza (silvanocerza)

Recent Commits

Stefano Fiorucci:
- Fix release note (#7711).
- Change HTML conversion backend from boilerpy3 to Trafilatura (#7705).
- Fix broken serialization of HFAPI components (#7661).
Carlos Fernández:
- Add keep-id to DocumentCleaner (#7703).
- Widen support of env vars in OpenAI components (#7653).
Sebastian Husch Lee:
- Add inference mode to ExtractiveReader (#7699).
- Update device deserialization for components that use local models (#7686).
Vladimir Blagojevic:
- Adjust serialization to handle PEP-585 generic types (#7690).
- Fix NamedEntityExtractor serde (#7684).
David S. Batista:
- Adding missing component decorator to AzureOpenAIGenerator (#7698).
- Avoid FaithfulnessEvaluator and ContextRelevanceEvaluator return Nan (#7685).
Massimiliano Pippi:
- Simplify Pipeline.run method by moving code to the base class (#7680).
Daria Fokina:
- Add pdfminer (#7688).
Madeesh Kannan:
- Fix Pipeline.run correctly returns all outputs when the include_outputs_from parameter is used (#7697).
Silvano Cerza:
- Bump version to 2.1.1.
- Make SparseEmbedding a dataclass (#7678).

Collaboration Patterns

The team exhibits strong collaboration with multiple contributors working on interconnected features and bug fixes. There is a focus on enhancing functionality, fixing bugs, improving compatibility with various environments, and refining documentation.

Recent Issues & PRs

Issues: Focus on bug fixes, feature enhancements, and addressing inconsistencies.
- Notable issues include non-deterministic behavior in LLM-based evaluators (#7713) and installation issues on Databricks (#7647).
PRs: Addressing feature requests and bug fixes.
- Notable PRs include adding support for callables as tokenizers in InMemoryDocumentStore (#7704) and allowing DocumentJoiner to accept a top_k parameter in the run method (#7709).

Risks

Notable Issues

Release Notes Inconsistency (#7716):
- Some PATCH releases are not showing on the website due to manual release notes creation.
Non-deterministic Behavior in LLM-based Evaluators (#7713):
- Difficulty replicating results due to non-deterministic sampling; reliance on a Beta feature (seed parameter).
Telemetry Tests Failures (#7708):
- Internal errors causing telemetry tests to fail; potential impact on other plugins.
Installation Issues on Databricks (#7647):
- ImportError related to typing_extensions specific to Databricks ML runtime 14.3 LTS.

Disputes

Naming Conventions in Generator Outputs (#7687):
- Different notations for similar concepts leading to potential disputes within the community.

Of Note

Proposal for Agent Memory API Level Feedback (#7714):
- A newly created task with no comments yet; critical for future development of memory components.
Colab Example Template for Chat + RAG Pipeline (#7627):
- Developing a Colab notebook demonstrating a Chat + RAG pipeline; important for user engagement and education.

Conclusion

The Haystack project is actively maintained with frequent updates and strong community engagement. The development team is focused on enhancing functionality, fixing bugs, and improving compatibility with various environments. However, there are notable risks related to release notes inconsistency, non-deterministic behavior in LLM-based evaluators, and installation issues on specific platforms like Databricks. Addressing these risks will be crucial for maintaining the project's positive trajectory.

Quantified Commit Activity Over 14 Days

Developer	Branches	PRs	Commits	Files	Changes
Massimiliano Pippi	3	8/8/0	13	328	5688
tstadel	2	3/0/1	25	9	2228
Silvano Cerza	4	3/2/0	7	7	1719
Vladimir Blagojevic	4	6/3/3	13	15	358
Guest400123064	1	0/1/0	1	4	351
Sebastian Husch Lee	2	2/2/0	3	13	299
Stefano Fiorucci	2	5/5/0	7	18	291
David S. Batista	2	3/3/0	9	13	252
Carlos Fernández	1	4/2/2	2	13	248
Rob Pasternak	1	1/0/0	8	2	230
Madeesh Kannan	2	1/1/0	2	3	100
Julian Risch	1	0/1/0	1	1	8
Daria Fokina	2	1/1/0	3	2	4
DL	2	1/1/0	2	1	2
Bilge Yücel	1	1/1/0	1	1	2
Gal Rabin (GalRabin)	0	1/0/1	0	0	0
Mo Sriha (medsriha)	0	0/0/1	0	0	0
Marco Juliani (M-JULIANI)	0	0/0/1	0	0	0
None (ArzelaAscoIi)	0	1/0/1	0	0	0
Varun Krishnan (Varun-Krishnan1)	0	1/0/0	0	0	0

_{PRs: created by that dev and opened/merged/closed-unmerged during the period}

Detailed Reports

Report On: Fetch commits

Project Overview

The Haystack project, managed by deepset-ai, is an orchestration framework for building customizable, production-ready applications powered by large language models (LLMs). It integrates various components such as models, vector databases, and file converters into pipelines or agents that can interact with data. This framework is particularly well-suited for tasks like retrieval-augmented generation (RAG), question answering, semantic search, and conversational agent chatbots. The project is actively maintained and has a robust community and user base, including notable organizations like Apple, Netflix, and Nvidia. With 13,908 stars on GitHub and frequent updates, the project is on a positive trajectory.

Recent Activities of the Development Team

Reverse Chronological List of Recent Commits

Main Branch

0 days ago - fix release note (#7711) by Stefano Fiorucci (anakin87)
- Files: releasenotes/notes/feat-widen-support-of-envars-vars-on-openai-components-efe6203c0c6bd7b3.yaml
- Changes: ~1 file modified (+0, -0), ~3 lines changed (+0, -3)
1 day ago - feat: change HTML conversion backend from boilerpy3 to Trafilatura (#7705) by Stefano Fiorucci (anakin87)
- Files: haystack/components/converters/html.py, pyproject.toml, releasenotes/notes/trafilatura-html-conversion-e9b9044d31fec794.yaml, test/components/converters/test_html_to_document.py
- Changes: ~3 files modified (+1, -0), ~150 lines changed (+61, -89)
1 day ago - add keep-id to DocumentCleaner (#7703) by Carlos Fernández (CarlosFerLo)
- Files: haystack/components/preprocessors/document_cleaner.py, releasenotes/notes/add-keep-id-to-document-cleaner-2a9854b5f195bb78.yaml, test/components/preprocessors/test_document_cleaner.py
- Changes: ~2 files modified (+1, -0), ~18 lines changed (+17, -1)
2 days ago - feat: widen support of env vars in OpenAI components (#7653) by Carlos Fernández (CarlosFerLo)
- Files: haystack/components/embedders/openai_document_embedder.py, haystack/components/embedders/openai_text_embedder.py, haystack/components/generators/chat/openai.py, haystack/components/generators/openai.py, haystack/telemetry/_environment.py, releasenotes/notes/feat-widen-support-of-envars-vars-on-openai-components-efe6203c0c6bd7b3.yaml, test/components/embedders/test_openai_document_embedder.py, test/components/embedders/test_openai_text_embedder.py, test/components/generators/chat/test_openai.py, test/components/generators/test_openai.py
- Changes: ~9 files modified (+1, -0), ~230 lines changed (+220, -10)
2 days ago - feat: Add inference mode to ExtractiveReader (#7699) by Sebastian Husch Lee (sjrl)
- Files: haystack/components/readers/extractive.py, releasenotes/notes/add-inf-mode-reader-e6eb79920e73c956.yaml
- Changes: ~1 file modified (+1, -0), ~7 lines changed (+6, -1)
2 days ago - fix: Adjust serialization to handle PEP-585 generic types (#7690) by Vladimir Blagojevic (vblagoje)
- Files: haystack/utils/type_serialization.py, releasenotes/notes/improve-type-serialization-support-18822a5b978b1e77.yaml, test/utils/test_type_serialization.py
- Changes: ~2 files modified (+1, -0), ~55 lines changed (+51, -4)
3 days ago - fix: Adding missing component decorator to AzureOpenAIGenerator (#7698) by David S. Batista (davidsbatista)
- Files: haystack/components/generators/azure.py, haystack/components/generators/chat/azure.py, releasenotes/notes/fix-azure-generators-serialization-18fcdc9cbcb3732e.yaml, test/components/generators/chat/test_azure.py, test/components/generators/test_azure.py
- Changes: ~4 files modified (+1, -0), ~31 lines changed (+29, -2)
3 days ago - chore: Simplify Pipeline.run method by moving code to the base class (#7680) by Massimiliano Pippi (masci)
- Files: haystack/core/pipeline/base.py, haystack/core/pipeline/pipeline.py, test/core/pipeline/test_pipeline.py
- Changes: ~3 files modified (+0, -0), ~282 lines changed (+188, -94)
3 days ago - fix: avoid FaithfulnessEvaluator and ContextRelevanceEvaluator return Nan (#7685) by David S. Batista (davidsbatista)
- Files: haystack/components/evaluators/context_relevance.py, haystack/components/evaluators/faithfulness.py, releasenotes/notes/avoid-LLM-based-evaluators-returning-NaN-579bc4593febb691.yaml, test/components/evaluators/test_context_relevance_evaluator.py,test/components/evaluators/test_faithfulness_evaluator.py
- Changes: ~4 files modified (+1,-0),~82 lines changed(+80,-2)
4 days ago – add pdfminer (#7688) by Daria Fokina (dfokina) – Files: docs/pydoc/config/converters_api.yml – Changes:~ 1 file modified(+ 0,-0),~ 1 line changed(+ 1,-0)

11. 4 days ago – fix:Pipeline.run correctly returns all outputs when theinclude_outputs_from parameter is used(#7697)by Madeesh Kannan(shadeMe) – Files:haystack/core/pipeline/pipeline.py,releasenotes/notes/pipeline-run-fix-extra-outputs-a6c750a91faaa8fd.yaml,test/core/pipeline/test_intermediate_outputs.py – Changes:~ 2 files modified(+ 1,-0),~ 50 lines changed(+ 48,-2)

12. 4 days ago – fix:Fix NamedEntityExtractor serde(#7684)by Vladimir Blagojevic(vblagoje) – Files:haystack/components/extractors/named_entity_extractor.p y,releasenotes/notes/named-entity-extractor-serde-improvements-28b594be5a38f175.yaml,test/components/extractors/test_named_entity_extractor.p y – Changes:~ 2 files modified(+ 1,-0),~ 26 lines changed(+ 21,-5)

13. 4 days ago – fix:forcing response format to be JSON valid(#7692)by David S.Batista(davidsbatista) – Files:haystack/components/evaluators/llm_evaluator.p y,releasenotes/notes/force-valid-JSON-OpeanAI-LLM-based-evaluators-64816e68f137739b.yam l – Changes:~ 1 file modified(+ 1,-0),~ 10 lines changed(+ 9,-1)

14. 4 days ago – fix:Update device deserialization for components that use local models(#7686)by Sebastian Husch Lee(sjrl) – Files:haystack/components/audio/whisper_local.p y,haystack/components/embedders/sentence_transformers_document_embedder.p y,haystack/components/embedders/sentence_transformers_text_embedder.p y,haystack/components/extractors/named_entity_extractor.p y,haystack/components/rankers/sentence_transformers_diversity.p y,releasenotes/notes/fix-device-deserialization-st-embedder-c4efad96dd3869d5.yam l,test/components/audio/test_whisper_local.p y,test/components/embedders/test_sentence_transformers_document_embedder.p y,test/components/embedders/test_sentence_transformers_text_embedder.p y,test/components/extractors/test_named_entity_extractor.p y,test/components/rankers/test_sentence_transformers_diversity.p y – Changes:~10 files modified(+1,-0),~146 lines changed(+133,-13)

15.8 days ago– Update pipeline.p y(#7679)by DL(Rictus) – Files:haystack/core/pipeline/pipeline.p y – Changes:~1 file modified(+0,-0),~1 line changed(+0,-1)

16.8 days ago– Bump version to 2.1.1by Silvano Cerza(silvanocerza) – Files:VERSION.txt – Changes:~1 file modified(+0,-0),~2 lines changed(+1,-1)

17.8 days ago– Make SparseEmbedding a dataclass(#7678)by Silvano Cerza(silvanocerza) – Files:haystack/dataclasses/document.p y,haystack/dataclasses/sparse_embedding.p y,releasenotes/notes/sparse-embedding-dataclass-d75ae1ee6d75e646.yam l – Changes:~2 files modified(+1,-0),~45 lines changed(+25,-20)

18.9 days ago–fix broken serialization of HFAPI components(#7661)by Stefano Fiorucci(anakin87) – Files:haystack/components/embedders/hugging_face_api_document_embedder.p y,haystack/components/embedders/hugging_face_api_text_embedder.p y,haystack/components/generators/chat/hugging_face_api.p y,haystack/components/generators/hugging_face_api.p y,releasenotes/notes/fix-hf-api-serialization-026b84de29827c57.yam l,test/components/embedders/test_hugging_face_api_document_embedder.p y,test/components/embedders/test_hugging_face_api_text_embedder.p y,test/components/generators/chat/test_hugging_face_api.p y,test/components/generators/test_hugging_face_api.p y – Changes:~8 files modified(+1,-0),~21 lines changed(+13,-8)

19.9days ago–fix serialization ofDocumentRecallEvaluator(#7662) fix serialization of DocumentRecallEvaluator add requested tests by Stefano Fiorucci(anakin87) Files: haystack / components / evaluators / document_recall .py( +11 ,- 1) releasenotes / notes / fix-serialization-docrecallevaluator-91ad772ffed119ed .yaml(added , +4) test / components / evaluators / test_document_recall .py( +31 ,- 0) File totals : ~2 , +1 ,- 0 Line totals : ~47 , +46 ,- 1

20 .11days ago–update version by David S .Batista(davidsbatista) Files : VERSION .txt( +1 ,- 1) File totals : ~1 , +0 ,- 0 Line totals : ~2 , +1 ,- 1

Patterns and Conclusions

The recent activities show a highly active development team working on various aspects of the Haystack project:

Frequent updates and improvements are being made to enhance functionality and performance.
The team is actively fixing bugs and addressing issues related to serialization and component integration.
Significant focus on improving compatibility with various environments and external services like OpenAI.
Continuous improvement in documentation and testing practices.
Collaborative efforts are evident with multiple contributors working together on complex features.

Overall,the development team is diligently working towards making Haystack more robust , flexible , and user-friendly .

Report On: Fetch issues

Analysis of Open Issues for deepset-ai/haystack

Overview

The repository currently has 173 open issues. The issues cover a range of topics including feature requests, bug reports, and enhancements. Below is a detailed analysis of some notable problems, uncertainties, disputes, TODOs, and anomalies among the open issues.

Notable Problems and Uncertainties

Issue #7716: Some PATCH releases are not showing on the website

Problem: The 2.1.x patch releases aren't listed on the website.
Comments: The release notes are manually created, which might be causing inconsistencies.
Uncertainty: Whether this manual process will be automated in the future to avoid such discrepancies.

Issue #7713: Make use of the `seed` parameter in the LLM-based evaluators using the OpenAI API

Problem: Non-deterministic behavior in LLM-based evaluators makes it hard to replicate results.
Solution: Use the seed parameter to ensure deterministic sampling.
Uncertainty: The seed feature is still in Beta, which might affect its reliability.

Issue #7712: LLM-based evaluators stopping conditions and valid JSON

Problem: Inconsistent handling of different finish_reason values from OpenAI responses.
Solution: Ensure all possible stopping conditions are handled and return valid JSON.
Uncertainty: Comprehensive testing is required to cover all edge cases.

Issue #7708: Test telemetry tests so they don't fail

Problem: Telemetry tests are failing due to internal errors with pytest.
Solution: Rewrite some telemetry tests to ensure they run correctly.
Uncertainty: Potential for other plugins to fail due to these tests.

Issue #7704: Add support for callables as a tokenizer in InMemoryDocumentStore

Problem: Current implementation does not support callables as tokenizers.
Solution: Update the code to allow callables as tokenizers.
Uncertainty: Impact on existing implementations and backward compatibility.

Disputes

Issue #7687: Homogenize Generator meta output

Dispute: Different generators use different notations for similar concepts (e.g., prompt_tokens vs. input_tokens).
Resolution Path: Aligning with OpenAI's naming conventions.
Comments: Some community members prefer different naming conventions, leading to potential disputes.

TODOs

Issue #7714: Create proposal for agent memory to get feedback on API level

Task:
- Create a proposal for new components needed for memory implementation.
- Show pseudo-code examples for Chat + RAG use-case.
Status: Newly created, no comments yet.

Issue #7627: Create a colab with an example template Chat + RAG pipeline

Task:
- Develop a Colab notebook demonstrating a Chat + RAG pipeline.
- Provide step-by-step instructions and example code.
Status: No updates or comments yet.

Anomalies

Issue #7647: Installation issues on Databricks

Anomaly: ImportError related to typing_extensions when installing Haystack on Databricks ML runtime 14.3 LTS.
Comments:
- Specific to Databricks environment; may not affect other environments.
- Requires further investigation to identify root cause.

Issue #7648: Use case RAG + one-shot query planning

Anomaly:
- Complex user stories involving multiple data sources and conditional logic.
- Requires breaking down into smaller sub-tasks for better manageability.

Recently Closed Issues

Issue #7711: Fix release note

Resolution:
- Removed incorrect section from release notes.
- Ensures clarity and correctness in documentation.

Issue #7705: Change HTML conversion backend from boilerpy3 to Trafilatura

Resolution:
- Switched backend library for HTML conversion to improve robustness and maintenance.
- Successfully tested with diverse HTML pages.

Conclusion

The deepset-ai/haystack repository has a mix of feature requests, bug fixes, and enhancements. Notable problems include inconsistencies in release notes, non-deterministic behavior in LLM-based evaluators, and installation issues on specific platforms like Databricks. There are also ongoing disputes about naming conventions in generator outputs and several important TODOs related to agent memory proposals and example pipelines. Recent efforts have focused on improving documentation accuracy and switching libraries for better performance.

Report On: Fetch pull requests

Analysis of Pull Requests for deepset-ai/haystack

Open Pull Requests

PR #7709: feat: allow DocumentJoiner to accept top_k parameter in run method

State: Open
Created: 1 day ago
Summary: This PR addresses an issue where the DocumentJoiner component's run method did not accept a top_k parameter, causing a ValueError. The PR proposes changes to accept and use the top_k parameter to limit the number of returned documents.
Testing: Unit tests were added to ensure the top_k parameter works correctly.
Notable: This is a recent PR and seems well-documented with unit tests included.

PR #7708: test: Fix telemetry tests so they don't fail

State: Open
Created: 1 day ago
Summary: This PR rewrites telemetry tests to ensure they run correctly with pytest. The previous implementation caused failures due to internal errors.
Testing: Tests were run locally.
Notable: This is a critical fix for ensuring the stability of telemetry tests.

PR #7706: test: Group up Pipeline unit tests in a single class

State: Open
Created: 1 day ago
Summary: This PR groups Pipeline unit tests into a single class in preparation for future changes that will make testing easier.
Testing: Tests were run locally.
Notable: This is part of a larger effort to improve test organization and maintainability.

PR #7704: feat: add support for callables as a tokeniser in InMemoryDocumentStore

State: Open (Draft)
Created: 1 day ago
Summary: This PR adds support for using callables as tokenizers in the InMemoryDocumentStore.
Testing: Not specified.
Notable: This is still in draft status and lacks detailed testing information.

PR #7700: feat: Add options for what to do with missing metadata fields in `MetaFieldRanker`

State: Open (Draft)
Created: 2 days ago
Summary: Adds a missing_meta parameter to handle documents missing sorting metadata fields with options like 'drop', 'top', and 'bottom'.
Testing: New test functions were added.
Notable: This adds significant functionality to handle edge cases in metadata sorting.

PR #7663: feat: add ChatPromptBuilder, deprecate DynamicChatPromptBuilder

State: Open
Created: 10 days ago
Summary: Extends ChatPromptBuilder to change prompts at query time and deprecates DynamicChatPromptBuilder.
Testing: Added tests.
Notable: There are no breaking changes, but it deprecates an existing component.

PR #7655: feat: extend PromptBuilder and deprecate DynamicPromptBuilder

State: Open
Created: 11 days ago, edited 1 day ago
Summary: Extends PromptBuilder to change prompts at query time and deprecates DynamicPromptBuilder.
Testing: Added tests.
Notable Issues: There are concerns about how this change will be communicated to users, especially regarding variable handling.

PR #7556: fix(JsonSchemaValidator): fix recursive loop and general LLM (claude, mistral...) compatibility

State: Open
Created: 30 days ago, edited 1 day ago
Summary: Fixes recursive loop issues and improves compatibility with various LLMs.
Testing: Manual testing on personal use-case.
Notable Issues: Missing reno note and unit tests. Important addition but needs more work.

PR #7514: feat: Retire openapi3, use openapi-service-client instead

State: Open
Created: 38 days ago, edited 4 days ago
Summary: Replaces the underlying openapi3 library with openapi-service-client for better control and future-proofing.
Testing: Modified unit tests and introduced integration tests.
Notable Issues: Significant change that requires thorough testing.

PR #7078: feat: Add default OutputAdapter filters (post 2.0)

State: Open (Draft)
Created: 84 days ago, edited 2 days ago
Summary: Collects useful filters for OutputAdapter that keep repeating. Includes filters like 'change_role', 'prepare_fc_params', 'tojson'.
Testing: Unit tests executed.
Notable Issues: Long-standing draft; needs review for relevance and completeness.

PR #6636: feat: Add `MetadataBuilder`

State: Open -Created: 147 days ago, edited 117 days ago -Summary: Adds MetadataBuilder component to add generator output as metadata to documents. -Testing: Not specified. -Notable Issues: Needs more definition on its scope and fit into pipeline architecture.

PR #5666: [DRAFT] Add FileSimilarityRetriever to haystack

-State: Open -Created: 262 days ago, edited 115 days ago -Summary: Adds FileSimilarityRetriever component. -Testing: Unit tests need to be added. -Notable Issues: Long-standing draft; needs attention for completion.

PR #5629: Proposal to add file similarity retriever to haystack

-State: Open -Created: 266 days ago, edited 7 days ago -Summary: Proposal for adding FileSimilarityRetriever component. -Testing: Not specified. -Notable Issues: Needs alignment with current architecture plans.

Closed Pull Requests

Recently Closed

PR #7711: chore: fix release note

-Merged by Stefano Fiorucci (anakin87) -Fixed an issue with the release note format.

PR #7705: feat: change HTML conversion backend from boilerpy3 to Trafilatura

-Merged by Stefano Fiorucci (anakin87) -Changed HTML conversion backend from boilerpy3 to Trafilatura.

PR #7703: feat:add keep-id to DocumentCleaner

-Merged by Silvano Cerza (silvanocerza) -Added optional property keep_id to DocumentCleaner.

PR #7699 :feat:Add inference mode to ExtractiveReader

-Merged by Massimiliano Pippi (masci)
Adds inference mode preventing gradients during inference time in pytorch.

PR#7698 :fix :Adding missing component decorator AzureOpenAIGenerator

-Merged by David S.Batista(davidsbatista)
Fixes serialization issue by adding missing @component decorator.

PR#7697 :fix :Pipeline.run correctly returns all outputs when include_outputs_from parameter used

-Merged by Madeesh Kannan(shadeMe)
Fixes issue ensuring all intermediate outputs are included in final output.

Report On: Fetch PR 7709 For Assessment

PR #7709: feat: allow DocumentJoiner to accept top_k parameter in run method

Summary

This pull request introduces a new feature to the DocumentJoiner component, allowing it to accept a top_k parameter in its run method. This enhancement addresses issue #7702, which caused a ValueError when trying to pass the top_k parameter at query time using pipe.run("DocumentJoiner": {"top_k": top_k}).

Changes

Code Changes:
- DocumentJoiner Component:
  - The run method now accepts an optional top_k parameter.
  - If top_k is provided during the method call, it overrides the instance's top_k.
  - The method limits the number of returned documents based on the provided or instance's top_k.
- Unit Tests:
  - Added a new test case in test_document_joiner.py to verify that the run method correctly handles the top_k parameter and limits the number of returned documents.
Documentation:
- Updated docstrings in the run method to include the new top_k parameter.
- Added a release note detailing the enhancement.
Files Modified:
- haystack/components/joiners/document_joiner.py: Modified to include the new functionality.
- test/components/joiners/test_document_joiner.py: Added a unit test for the new functionality.
- releasenotes/notes/fix-documentjoiner-topk-173141a894e5c093.yaml: Added release note.

Code Quality Assessment

Functionality:
- The changes are straightforward and add valuable functionality by allowing dynamic control over the number of documents returned.
- The implementation checks for both instance-level and method-level top_k, ensuring flexibility and backward compatibility.
Testing:
- A unit test has been added to verify that the new functionality works as expected.
- The test ensures that when top_k is provided in the method call, it correctly limits the number of returned documents.
Documentation:
- Docstrings have been updated to reflect the new parameter.
- A release note has been added, providing clear information about the enhancement.
Code Style:
- The code follows standard Python conventions and is consistent with existing code in terms of style and structure.
- Proper use of optional parameters and type hints enhances readability and maintainability.

Recommendations

Additional Tests: While a single unit test has been added, it might be beneficial to add more tests covering edge cases, such as when no documents are provided or when top_k exceeds the total number of documents.
Performance Considerations: Ensure that performance is not significantly impacted when handling large lists of documents, especially when sorting by score before applying top_k.

Overall, this PR is well-implemented and adds useful functionality to the DocumentJoiner component. It adheres to good coding practices and includes necessary documentation and testing.

PR #7704: feat: add support for callables as a tokeniser in InMemoryDocumentStore

Summary

This pull request introduces support for callables as tokenizers in the InMemoryDocumentStore. This enhancement addresses issue #4720, allowing greater flexibility in how documents are tokenized within the document store.

Changes

Code Changes:
- InMemoryDocumentStore Component:
  - Modified to accept callable functions as tokenizers.
  - Ensures that these callables can be used seamlessly within existing workflows.
- Unit Tests:
  - Added tests to verify that callable tokenizers work correctly within the document store.
Documentation:
- Updated docstrings to reflect support for callable tokenizers.
- Added a release note detailing this enhancement.
Files Modified:
- haystack/document_stores/in_memory/document_store.py: Modified to include support for callable tokenizers.
- Unit tests file (not explicitly mentioned but inferred from context).

Code Quality Assessment

Functionality:
- This change significantly enhances flexibility by allowing custom tokenization logic through callables.
- The implementation appears robust, ensuring that callable tokenizers integrate smoothly with existing methods.
Testing:
- Unit tests have been added to ensure that callable tokenizers function as expected.
- These tests likely cover various scenarios, including different types of callables and edge cases.
Documentation:
- Docstrings have been updated appropriately to inform users about this new capability.
- A release note provides clear information about what has been added and how it can be used.
Code Style:
- The code modifications adhere to Python conventions and are consistent with existing code in terms of style and structure.
- Proper handling of callable types ensures maintainability and readability.

Recommendations

Additional Documentation: Consider adding examples in user-facing documentation or tutorials demonstrating how to use callable tokenizers effectively within different contexts.
Comprehensive Testing: Ensure that tests cover a wide range of callable types, including lambdas, named functions, and class-based callables, to guarantee robustness across various use cases.

This PR adds valuable flexibility to the InMemoryDocumentStore, allowing users to define custom tokenization logic easily. It follows good coding practices, includes necessary documentation updates, and provides thorough testing.

Overall, both PRs introduce meaningful enhancements that improve flexibility and usability within their respective components. They adhere to good coding standards, include necessary documentation updates, and provide adequate testing coverage.

Report On: Fetch Files For Assessment

Source Code Assessment

1. `.github/workflows/tests.yml`

Structure and Quality:

Purpose: This file is a GitHub Actions workflow configuration for running tests on the repository.
Triggers: The workflow is triggered on push events to the main branch, release branches, and on pull requests.
Environment Variables: Uses secrets for sensitive information like API keys.
Jobs:
- black: Checks code formatting using black.
- install-dependencies: Installs dependencies across multiple OS environments (Ubuntu, macOS, Windows).
- unit-tests: Runs unit tests across multiple OS environments.
- integration-tests: Runs integration tests on Ubuntu, macOS, and Windows.
- trigger-catch-all: Ensures all tests are completed before marking the workflow as successful.

Observations:

Modularity: The workflow is modular with separate jobs for different tasks (formatting, installing dependencies, running tests).
Caching: Uses caching to speed up dependency installation.
Notifications: Sends events to Datadog for monitoring job statuses.
Coverage Reporting: Uses Coveralls to report test coverage.

Recommendations:

Documentation: Add comments explaining each job's purpose for better readability.
Error Handling: Ensure that failure scenarios are well-handled and provide meaningful feedback.

2. `haystack/components/converters/html.py`

Structure and Quality:

Purpose: Converts HTML files to Document objects using the Trafilatura library.
Class Design:
- HTMLToDocument class with methods for initialization, serialization (to_dict), deserialization (from_dict), and conversion (run).
- Handles deprecated parameters with warnings.
Error Handling: Logs warnings if it fails to read or extract text from sources.

Observations:

Dependency Management: Uses Trafilatura for HTML extraction.
Serialization/Deserialization: Implements methods to serialize and deserialize the component state.
Logging: Utilizes logging for error handling.

Recommendations:

Type Annotations: Ensure all methods have complete type annotations for better clarity and static analysis.
Docstrings: Add detailed docstrings for each method explaining parameters and return types.

3. `haystack/components/preprocessors/document_cleaner.py`

Structure and Quality:

Purpose: Cleans text documents by removing extra whitespaces, empty lines, specified substrings, regexes, page headers, and footers.
Class Design:
- DocumentCleaner class with methods for initialization, cleaning (run), and various helper methods for specific cleaning tasks.
Error Handling: Logs warnings if document content is None.

Observations:

Comprehensive Cleaning Options: Provides multiple options for cleaning documents (e.g., removing substrings, regex matches).
Helper Methods: Uses private helper methods to modularize cleaning tasks.

Recommendations:

Performance Optimization: Consider optimizing regex operations if dealing with large texts frequently.
Docstrings and Comments: Add more detailed docstrings and inline comments to explain complex logic.

4. `haystack/components/embedders/openai_text_embedder.py`

Structure and Quality:

Purpose: Embeds strings using OpenAI models.
Class Design:
- OpenAITextEmbedder class with methods for initialization, embedding (run), serialization (to_dict), and deserialization (from_dict).
- Supports environment variables for configuration.

Observations:

Environment Variable Support: Allows configuration through environment variables which is good practice for sensitive information like API keys.
Error Handling: Raises appropriate errors if input types are incorrect.

Recommendations:

Dependency Management: Ensure that dependencies like OpenAI are properly managed and documented.
Docstrings: Add detailed docstrings explaining each parameter in the constructor.

5. `haystack/core/pipeline/base.py`

Structure and Quality:

Purpose: Contains core pipeline functionalities including the main run method which orchestrates component execution.
Class Design:
- PipelineBase class with a complex run method handling data flow between components.

Observations:

Complex Logic Handling: Manages complex logic for running components in sequence or parallel based on their dependencies.
Error Handling: Raises custom errors like PipelineRuntimeError.

Recommendations:

Refactoring Needed: The run method is quite large and could benefit from refactoring into smaller helper methods to improve readability and maintainability.
Detailed Documentation Needed: Given the complexity, detailed documentation explaining the flow of data through the pipeline would be beneficial.

6. `haystack/components/evaluators/faithfulness.py`

Structure and Quality:

Purpose: Evaluates the faithfulness of generated answers based on provided contexts using an LLM (Large Language Model).
Class Design:
- FaithfulnessEvaluator class extending from LLMEvaluator, with methods for initialization, evaluation (run), serialization (to_dict), and deserialization (from_dict).

Observations:

Default Examples Provided: Includes default examples for evaluation which can be overridden by user-provided examples.
Usage of LLMs: Leverages LLMs to evaluate the faithfulness of answers which is a sophisticated approach.

Recommendations:

Performance Considerations: Ensure that the evaluation process is optimized for performance given that LLMs can be resource-intensive.
Detailed Docstrings Needed: Add more detailed docstrings explaining the expected format of inputs and outputs.

7. `test/components/generators/chat/test_openai.py`

Structure and Quality:

Purpose: Contains tests for OpenAI chat generators ensuring functionality after recent changes.
Test Coverage Areas:
- Initialization tests
- Serialization/Deserialization tests
- Functional tests (e.g., running chat generation)
- Integration tests (requires actual API keys)

Observations:

Fixture Usage: Uses pytest fixtures effectively to manage test data setup.
Mocking External Dependencies: Mocks external dependencies like OpenAI API calls which is good practice in unit testing.

Recommendations:

Test Documentation Needed: Add comments or docstrings explaining what each test case is verifying.
Edge Cases Testing: Ensure edge cases are covered in tests (e.g., invalid inputs).

8. `haystack/core/pipeline/pipeline.py`

Structure and Quality:

Purpose:** Contains core functionalities related to pipeline execution including a simplified version of the Pipeline.run method after recent refactoring.

Observations:

Refactored Run Method:** The Pipeline.run method has been simplified which should improve readability and maintainability.

Component Execution:** Manages synchronous execution of components based on their dependencies.

Recommendations:

Further Refactoring:** Consider breaking down complex logic within the run method into smaller helper functions or methods to further improve readability.

Detailed Documentation:** Given its central role in pipeline execution, ensure comprehensive documentation explaining how data flows through the pipeline.

9. haystack/components/extractors/named_entity_extractor.py

Structure and Quality:

Purpose:** Extracts named entities from documents using either Hugging Face or spaCy backends.

Class Design:** NamedEntityExtractor class with methods for initialization, annotation (run), serialization (to_dict), deserialization (from_dict), and backend management (_NerBackend).

Observations:

Backend Flexibility:** Supports multiple backends (Hugging Face, spaCy) providing flexibility in NER model usage.

Serialization/Deserialization:** Implements methods to serialize and deserialize component state effectively.

Recommendations:

Type Annotations:** Ensure all methods have complete type annotations for better clarity and static analysis.

Docstrings:** Add detailed docstrings explaining parameters, especially in private helper classes like _NerBackend.

10. test/core/pipeline/test_pipeline.py

Structure and Quality:

Purpose:** Contains tests for pipeline functionalities ensuring robustness after recent refactoring.

Test Coverage Areas:** Tests various aspects of pipeline execution including component connections, data flow, error handling, etc.

Observations:

Comprehensive Testing:** Covers a wide range of scenarios ensuring robustness of pipeline logic.

Mocking External Dependencies:** Mocks external dependencies effectively to isolate unit tests from external factors.

Recommendations:

Test Documentation Needed:** Add comments or docstrings explaining what each test case is verifying.

Edge Cases Testing:** Ensure edge cases are covered in tests (e.g., cyclic dependencies).

Summary

The codebase exhibits a high level of organization and modularity across various components. The use of environment variables for sensitive information is commendable. However, there are areas where improvements can be made:

Refactoring Complex Methods:** Some methods like Pipeline.run are quite large and could benefit from further refactoring into smaller helper functions or methods.

Detailed Documentation:** Adding comprehensive documentation including detailed docstrings explaining parameters, return types, expected input/output formats would greatly enhance readability and maintainability.

By addressing these recommendations, the codebase can achieve even higher standards of quality and maintainability.

The Dispatch Demo - deepset-ai/haystack

Executive Summary

Recent Activity

Team Members

Recent Commits

Collaboration Patterns

Recent Issues & PRs

Risks

Notable Issues

Disputes

Of Note

Conclusion

Quantified Commit Activity Over 14 Days

Detailed Reports

Report On: Fetch commits

Project Overview

Recent Activities of the Development Team

Reverse Chronological List of Recent Commits

Main Branch

Patterns and Conclusions

Report On: Fetch issues

Analysis of Open Issues for deepset-ai/haystack

Overview

Notable Problems and Uncertainties

Issue #7716: Some PATCH releases are not showing on the website

Issue #7713: Make use of the seed parameter in the LLM-based evaluators using the OpenAI API

Issue #7712: LLM-based evaluators stopping conditions and valid JSON

Issue #7708: Test telemetry tests so they don't fail

Issue #7704: Add support for callables as a tokenizer in InMemoryDocumentStore

Disputes

Issue #7687: Homogenize Generator meta output

TODOs

Issue #7714: Create proposal for agent memory to get feedback on API level

Issue #7627: Create a colab with an example template Chat + RAG pipeline

Anomalies

Issue #7647: Installation issues on Databricks

Issue #7648: Use case RAG + one-shot query planning

Recently Closed Issues

Issue #7711: Fix release note

Issue #7705: Change HTML conversion backend from boilerpy3 to Trafilatura

Conclusion

Report On: Fetch pull requests

Analysis of Pull Requests for deepset-ai/haystack

Open Pull Requests

PR #7709: feat: allow DocumentJoiner to accept top_k parameter in run method

PR #7708: test: Fix telemetry tests so they don't fail

PR #7706: test: Group up Pipeline unit tests in a single class

PR #7704: feat: add support for callables as a tokeniser in InMemoryDocumentStore

PR #7700: feat: Add options for what to do with missing metadata fields in MetaFieldRanker

PR #7663: feat: add ChatPromptBuilder, deprecate DynamicChatPromptBuilder

PR #7655: feat: extend PromptBuilder and deprecate DynamicPromptBuilder

PR #7556: fix(JsonSchemaValidator): fix recursive loop and general LLM (claude, mistral...) compatibility

PR #7514: feat: Retire openapi3, use openapi-service-client instead

PR #7078: feat: Add default OutputAdapter filters (post 2.0)

PR #6636: feat: Add MetadataBuilder

PR #5666: [DRAFT] Add FileSimilarityRetriever to haystack

PR #5629: Proposal to add file similarity retriever to haystack

Closed Pull Requests

Recently Closed

PR #7711: chore: fix release note

PR #7705: feat: change HTML conversion backend from boilerpy3 to Trafilatura

PR #7703: feat:add keep-id to DocumentCleaner

PR #7699 :feat:Add inference mode to ExtractiveReader

PR#7698 :fix :Adding missing component decorator AzureOpenAIGenerator

PR#7697 :fix :Pipeline.run correctly returns all outputs when include_outputs_from parameter used

Report On: Fetch PR 7709 For Assessment

PR #7709: feat: allow DocumentJoiner to accept top_k parameter in run method

Summary

Changes

Code Quality Assessment

Recommendations

PR #7704: feat: add support for callables as a tokeniser in InMemoryDocumentStore

Summary

Changes

Code Quality Assessment

Recommendations

Report On: Fetch Files For Assessment

Source Code Assessment

1. .github/workflows/tests.yml

Structure and Quality:

Issue #7713: Make use of the `seed` parameter in the LLM-based evaluators using the OpenAI API

PR #7700: feat: Add options for what to do with missing metadata fields in `MetaFieldRanker`

PR #6636: feat: Add `MetadataBuilder`

1. `.github/workflows/tests.yml`

2. `haystack/components/converters/html.py`

3. `haystack/components/preprocessors/document_cleaner.py`

4. `haystack/components/embedders/openai_text_embedder.py`

5. `haystack/core/pipeline/base.py`

6. `haystack/components/evaluators/faithfulness.py`

7. `test/components/generators/chat/test_openai.py`

8. `haystack/core/pipeline/pipeline.py`