The Haystack project, managed by deepset-ai, is an orchestration framework for building customizable, production-ready applications powered by large language models (LLMs). It integrates various components such as models, vector databases, and file converters into pipelines or agents that can interact with data. The project is particularly well-suited for tasks like retrieval-augmented generation (RAG), question answering, semantic search, and conversational agent chatbots. With a robust community and user base, including notable organizations like Apple, Netflix, and Nvidia, the project is actively maintained and on a positive trajectory.
Stefano Fiorucci:
Carlos Fernández:
Sebastian Husch Lee:
Vladimir Blagojevic:
David S. Batista:
Massimiliano Pippi:
Pipeline.run
method by moving code to the base class (#7680).Daria Fokina:
Madeesh Kannan:
Pipeline.run
correctly returns all outputs when the include_outputs_from
parameter is used (#7697).Silvano Cerza:
The team exhibits strong collaboration with multiple contributors working on interconnected features and bug fixes. There is a focus on enhancing functionality, fixing bugs, improving compatibility with various environments, and refining documentation.
Issues: Focus on bug fixes, feature enhancements, and addressing inconsistencies.
PRs: Addressing feature requests and bug fixes.
Release Notes Inconsistency (#7716):
Non-deterministic Behavior in LLM-based Evaluators (#7713):
seed
parameter).Telemetry Tests Failures (#7708):
Installation Issues on Databricks (#7647):
typing_extensions
specific to Databricks ML runtime 14.3 LTS.Proposal for Agent Memory API Level Feedback (#7714):
Colab Example Template for Chat + RAG Pipeline (#7627):
The Haystack project is actively maintained with frequent updates and strong community engagement. The development team is focused on enhancing functionality, fixing bugs, and improving compatibility with various environments. However, there are notable risks related to release notes inconsistency, non-deterministic behavior in LLM-based evaluators, and installation issues on specific platforms like Databricks. Addressing these risks will be crucial for maintaining the project's positive trajectory.
Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
Massimiliano Pippi | 3 | 8/8/0 | 13 | 328 | 5688 | |
tstadel | 2 | 3/0/1 | 25 | 9 | 2228 | |
Silvano Cerza | 4 | 3/2/0 | 7 | 7 | 1719 | |
Vladimir Blagojevic | 4 | 6/3/3 | 13 | 15 | 358 | |
Guest400123064 | 1 | 0/1/0 | 1 | 4 | 351 | |
Sebastian Husch Lee | 2 | 2/2/0 | 3 | 13 | 299 | |
Stefano Fiorucci | 2 | 5/5/0 | 7 | 18 | 291 | |
David S. Batista | 2 | 3/3/0 | 9 | 13 | 252 | |
Carlos Fernández | 1 | 4/2/2 | 2 | 13 | 248 | |
Rob Pasternak | 1 | 1/0/0 | 8 | 2 | 230 | |
Madeesh Kannan | 2 | 1/1/0 | 2 | 3 | 100 | |
Julian Risch | 1 | 0/1/0 | 1 | 1 | 8 | |
Daria Fokina | 2 | 1/1/0 | 3 | 2 | 4 | |
DL | 2 | 1/1/0 | 2 | 1 | 2 | |
Bilge Yücel | 1 | 1/1/0 | 1 | 1 | 2 | |
Gal Rabin (GalRabin) | 0 | 1/0/1 | 0 | 0 | 0 | |
Mo Sriha (medsriha) | 0 | 0/0/1 | 0 | 0 | 0 | |
Marco Juliani (M-JULIANI) | 0 | 0/0/1 | 0 | 0 | 0 | |
None (ArzelaAscoIi) | 0 | 1/0/1 | 0 | 0 | 0 | |
Varun Krishnan (Varun-Krishnan1) | 0 | 1/0/0 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
The Haystack project, managed by deepset-ai, is an orchestration framework for building customizable, production-ready applications powered by large language models (LLMs). It integrates various components such as models, vector databases, and file converters into pipelines or agents that can interact with data. This framework is particularly well-suited for tasks like retrieval-augmented generation (RAG), question answering, semantic search, and conversational agent chatbots. The project is actively maintained and has a robust community and user base, including notable organizations like Apple, Netflix, and Nvidia. With 13,908 stars on GitHub and frequent updates, the project is on a positive trajectory.
0 days ago - fix release note (#7711) by Stefano Fiorucci (anakin87)
1 day ago - feat: change HTML conversion backend from boilerpy3 to Trafilatura (#7705) by Stefano Fiorucci (anakin87)
1 day ago - add keep-id to DocumentCleaner (#7703) by Carlos Fernández (CarlosFerLo)
2 days ago - feat: widen support of env vars in OpenAI components (#7653) by Carlos Fernández (CarlosFerLo)
2 days ago - feat: Add inference mode to ExtractiveReader (#7699) by Sebastian Husch Lee (sjrl)
2 days ago - fix: Adjust serialization to handle PEP-585 generic types (#7690) by Vladimir Blagojevic (vblagoje)
3 days ago - fix: Adding missing component
decorator to AzureOpenAIGenerator (#7698) by David S. Batista (davidsbatista)
3 days ago - chore: Simplify Pipeline.run
method by moving code to the base class (#7680) by Massimiliano Pippi (masci)
3 days ago - fix: avoid FaithfulnessEvaluator and ContextRelevanceEvaluator return Nan
(#7685) by David S. Batista (davidsbatista)
4 days ago – add pdfminer (#7688) by Daria Fokina (dfokina) – Files: docs/pydoc/config/converters_api.yml – Changes:~ 1 file modified(+ 0,-0),~ 1 line changed(+ 1,-0)
11. 4 days ago – fix:Pipeline.run
correctly returns all outputs when theinclude_outputs_from
parameter is used(#7697)by Madeesh Kannan(shadeMe)
– Files:haystack/core/pipeline/pipeline.py,releasenotes/notes/pipeline-run-fix-extra-outputs-a6c750a91faaa8fd.yaml,test/core/pipeline/test_intermediate_outputs.py
– Changes:~ 2 files modified(+ 1,-0),~ 50 lines changed(+ 48,-2)
12. 4 days ago – fix:Fix NamedEntityExtractor serde(#7684)by Vladimir Blagojevic(vblagoje) – Files:haystack/components/extractors/named_entity_extractor.p y,releasenotes/notes/named-entity-extractor-serde-improvements-28b594be5a38f175.yaml,test/components/extractors/test_named_entity_extractor.p y – Changes:~ 2 files modified(+ 1,-0),~ 26 lines changed(+ 21,-5)
13. 4 days ago – fix:forcing response format to be JSON valid(#7692)by David S.Batista(davidsbatista) – Files:haystack/components/evaluators/llm_evaluator.p y,releasenotes/notes/force-valid-JSON-OpeanAI-LLM-based-evaluators-64816e68f137739b.yam l – Changes:~ 1 file modified(+ 1,-0),~ 10 lines changed(+ 9,-1)
14. 4 days ago – fix:Update device deserialization for components that use local models(#7686)by Sebastian Husch Lee(sjrl) – Files:haystack/components/audio/whisper_local.p y,haystack/components/embedders/sentence_transformers_document_embedder.p y,haystack/components/embedders/sentence_transformers_text_embedder.p y,haystack/components/extractors/named_entity_extractor.p y,haystack/components/rankers/sentence_transformers_diversity.p y,releasenotes/notes/fix-device-deserialization-st-embedder-c4efad96dd3869d5.yam l,test/components/audio/test_whisper_local.p y,test/components/embedders/test_sentence_transformers_document_embedder.p y,test/components/embedders/test_sentence_transformers_text_embedder.p y,test/components/extractors/test_named_entity_extractor.p y,test/components/rankers/test_sentence_transformers_diversity.p y – Changes:~10 files modified(+1,-0),~146 lines changed(+133,-13)
15.8 days ago– Update pipeline.p y(#7679)by DL(Rictus) – Files:haystack/core/pipeline/pipeline.p y – Changes:~1 file modified(+0,-0),~1 line changed(+0,-1)
16.8 days ago– Bump version to 2.1.1by Silvano Cerza(silvanocerza) – Files:VERSION.txt – Changes:~1 file modified(+0,-0),~2 lines changed(+1,-1)
17.8 days ago– Make SparseEmbedding a dataclass(#7678)by Silvano Cerza(silvanocerza) – Files:haystack/dataclasses/document.p y,haystack/dataclasses/sparse_embedding.p y,releasenotes/notes/sparse-embedding-dataclass-d75ae1ee6d75e646.yam l – Changes:~2 files modified(+1,-0),~45 lines changed(+25,-20)
18.9 days ago–fix broken serialization of HFAPI components(#7661)by Stefano Fiorucci(anakin87) – Files:haystack/components/embedders/hugging_face_api_document_embedder.p y,haystack/components/embedders/hugging_face_api_text_embedder.p y,haystack/components/generators/chat/hugging_face_api.p y,haystack/components/generators/hugging_face_api.p y,releasenotes/notes/fix-hf-api-serialization-026b84de29827c57.yam l,test/components/embedders/test_hugging_face_api_document_embedder.p y,test/components/embedders/test_hugging_face_api_text_embedder.p y,test/components/generators/chat/test_hugging_face_api.p y,test/components/generators/test_hugging_face_api.p y – Changes:~8 files modified(+1,-0),~21 lines changed(+13,-8)
19.9days ago–fix serialization ofDocumentRecallEvaluator
(#7662)
fix serialization of DocumentRecallEvaluator
add requested tests
by Stefano Fiorucci(anakin87)
Files:
haystack / components / evaluators / document_recall .py( +11 ,- 1)
releasenotes / notes / fix-serialization-docrecallevaluator-91ad772ffed119ed .yaml(added , +4)
test / components / evaluators / test_document_recall .py( +31 ,- 0)
File totals : ~2 , +1 ,- 0
Line totals : ~47 , +46 ,- 1
20 .11days ago–update version by David S .Batista(davidsbatista) Files : VERSION .txt( +1 ,- 1) File totals : ~1 , +0 ,- 0 Line totals : ~2 , +1 ,- 1
The recent activities show a highly active development team working on various aspects of the Haystack project:
Overall,the development team is diligently working towards making Haystack more robust , flexible , and user-friendly .
The repository currently has 173 open issues. The issues cover a range of topics including feature requests, bug reports, and enhancements. Below is a detailed analysis of some notable problems, uncertainties, disputes, TODOs, and anomalies among the open issues.
2.1.x
patch releases aren't listed on the website.seed
parameter in the LLM-based evaluators using the OpenAI APIseed
parameter to ensure deterministic sampling.seed
feature is still in Beta, which might affect its reliability.finish_reason
values from OpenAI responses.prompt_tokens
vs. input_tokens
).typing_extensions
when installing Haystack on Databricks ML runtime 14.3 LTS.The deepset-ai/haystack repository has a mix of feature requests, bug fixes, and enhancements. Notable problems include inconsistencies in release notes, non-deterministic behavior in LLM-based evaluators, and installation issues on specific platforms like Databricks. There are also ongoing disputes about naming conventions in generator outputs and several important TODOs related to agent memory proposals and example pipelines. Recent efforts have focused on improving documentation accuracy and switching libraries for better performance.
DocumentJoiner
component's run
method did not accept a top_k
parameter, causing a ValueError
. The PR proposes changes to accept and use the top_k
parameter to limit the number of returned documents.top_k
parameter works correctly.pytest
. The previous implementation caused failures due to internal errors.InMemoryDocumentStore
.MetaFieldRanker
missing_meta
parameter to handle documents missing sorting metadata fields with options like 'drop', 'top', and 'bottom'.ChatPromptBuilder
to change prompts at query time and deprecates DynamicChatPromptBuilder
.PromptBuilder
to change prompts at query time and deprecates DynamicPromptBuilder
.MetadataBuilder
-State: Open -Created: 262 days ago, edited 115 days ago -Summary: Adds FileSimilarityRetriever component. -Testing: Unit tests need to be added. -Notable Issues: Long-standing draft; needs attention for completion.
-State: Open -Created: 266 days ago, edited 7 days ago -Summary: Proposal for adding FileSimilarityRetriever component. -Testing: Not specified. -Notable Issues: Needs alignment with current architecture plans.
-Merged by Stefano Fiorucci (anakin87) -Fixed an issue with the release note format.
-Merged by Stefano Fiorucci (anakin87) -Changed HTML conversion backend from boilerpy3 to Trafilatura.
-Merged by Silvano Cerza (silvanocerza) -Added optional property keep_id to DocumentCleaner.
-Merged by Massimiliano Pippi (masci)
Adds inference mode preventing gradients during inference time in pytorch.
-Merged by David S.Batista(davidsbatista)
Fixes serialization issue by adding missing @component decorator.
-Merged by Madeesh Kannan(shadeMe)
Fixes issue ensuring all intermediate outputs are included in final output.
This pull request introduces a new feature to the DocumentJoiner
component, allowing it to accept a top_k
parameter in its run
method. This enhancement addresses issue #7702, which caused a ValueError
when trying to pass the top_k
parameter at query time using pipe.run("DocumentJoiner": {"top_k": top_k})
.
Code Changes:
run
method now accepts an optional top_k
parameter.top_k
is provided during the method call, it overrides the instance's top_k
.top_k
.test_document_joiner.py
to verify that the run
method correctly handles the top_k
parameter and limits the number of returned documents.Documentation:
run
method to include the new top_k
parameter.Files Modified:
haystack/components/joiners/document_joiner.py
: Modified to include the new functionality.test/components/joiners/test_document_joiner.py
: Added a unit test for the new functionality.releasenotes/notes/fix-documentjoiner-topk-173141a894e5c093.yaml
: Added release note.Functionality:
top_k
, ensuring flexibility and backward compatibility.Testing:
top_k
is provided in the method call, it correctly limits the number of returned documents.Documentation:
Code Style:
Additional Tests:
While a single unit test has been added, it might be beneficial to add more tests covering edge cases, such as when no documents are provided or when top_k
exceeds the total number of documents.
Performance Considerations:
Ensure that performance is not significantly impacted when handling large lists of documents, especially when sorting by score before applying top_k
.
Overall, this PR is well-implemented and adds useful functionality to the DocumentJoiner
component. It adheres to good coding practices and includes necessary documentation and testing.
This pull request introduces support for callables as tokenizers in the InMemoryDocumentStore
. This enhancement addresses issue #4720, allowing greater flexibility in how documents are tokenized within the document store.
Code Changes:
Documentation:
Files Modified:
haystack/document_stores/in_memory/document_store.py
: Modified to include support for callable tokenizers.Functionality:
Testing:
Documentation:
Code Style:
Additional Documentation: Consider adding examples in user-facing documentation or tutorials demonstrating how to use callable tokenizers effectively within different contexts.
Comprehensive Testing: Ensure that tests cover a wide range of callable types, including lambdas, named functions, and class-based callables, to guarantee robustness across various use cases.
This PR adds valuable flexibility to the InMemoryDocumentStore
, allowing users to define custom tokenization logic easily. It follows good coding practices, includes necessary documentation updates, and provides thorough testing.
Overall, both PRs introduce meaningful enhancements that improve flexibility and usability within their respective components. They adhere to good coding standards, include necessary documentation updates, and provide adequate testing coverage.
.github/workflows/tests.yml
push
events to the main
branch, release branches, and on pull requests.black
.haystack/components/converters/html.py
Document
objects using the Trafilatura library.HTMLToDocument
class with methods for initialization, serialization (to_dict
), deserialization (from_dict
), and conversion (run
).haystack/components/preprocessors/document_cleaner.py
DocumentCleaner
class with methods for initialization, cleaning (run
), and various helper methods for specific cleaning tasks.None
.haystack/components/embedders/openai_text_embedder.py
OpenAITextEmbedder
class with methods for initialization, embedding (run
), serialization (to_dict
), and deserialization (from_dict
).haystack/core/pipeline/base.py
run
method which orchestrates component execution.PipelineBase
class with a complex run
method handling data flow between components.PipelineRuntimeError
.run
method is quite large and could benefit from refactoring into smaller helper methods to improve readability and maintainability.haystack/components/evaluators/faithfulness.py
FaithfulnessEvaluator
class extending from LLMEvaluator
, with methods for initialization, evaluation (run
), serialization (to_dict
), and deserialization (from_dict
).test/components/generators/chat/test_openai.py
haystack/core/pipeline/pipeline.py
Observations:
Refactored Run Method:** The Pipeline.run method has been simplified which should improve readability and maintainability.
Component Execution:** Manages synchronous execution of components based on their dependencies.
Recommendations:
Further Refactoring:** Consider breaking down complex logic within the run method into smaller helper functions or methods to further improve readability.
Detailed Documentation:** Given its central role in pipeline execution, ensure comprehensive documentation explaining how data flows through the pipeline.
Structure and Quality:
Purpose:** Extracts named entities from documents using either Hugging Face or spaCy backends.
Class Design:** NamedEntityExtractor class with methods for initialization, annotation (run), serialization (to_dict), deserialization (from_dict), and backend management (_NerBackend).
Observations:
Backend Flexibility:** Supports multiple backends (Hugging Face, spaCy) providing flexibility in NER model usage.
Serialization/Deserialization:** Implements methods to serialize and deserialize component state effectively.
Recommendations:
Type Annotations:** Ensure all methods have complete type annotations for better clarity and static analysis.
Docstrings:** Add detailed docstrings explaining parameters, especially in private helper classes like _NerBackend.
Structure and Quality:
Purpose:** Contains tests for pipeline functionalities ensuring robustness after recent refactoring.
Test Coverage Areas:** Tests various aspects of pipeline execution including component connections, data flow, error handling, etc.
Observations:
Comprehensive Testing:** Covers a wide range of scenarios ensuring robustness of pipeline logic.
Mocking External Dependencies:** Mocks external dependencies effectively to isolate unit tests from external factors.
Recommendations:
Test Documentation Needed:** Add comments or docstrings explaining what each test case is verifying.
Edge Cases Testing:** Ensure edge cases are covered in tests (e.g., cyclic dependencies).
The codebase exhibits a high level of organization and modularity across various components. The use of environment variables for sensitive information is commendable. However, there are areas where improvements can be made:
Refactoring Complex Methods:** Some methods like Pipeline.run are quite large and could benefit from further refactoring into smaller helper functions or methods.
Detailed Documentation:** Adding comprehensive documentation including detailed docstrings explaining parameters, return types, expected input/output formats would greatly enhance readability and maintainability.
By addressing these recommendations, the codebase can achieve even higher standards of quality and maintainability.