OSS Report: Future-House/paper-qa

Sept. 14, 2024, 1:30 p.m. UTC This report was generated by Dispatch AI

PaperQA2 Development Faces Challenges with API Key Access and Performance Regressions

PaperQA2, a Python-based tool designed for answering questions from scientific documents, is experiencing issues with API key access and performance regressions, impacting user experience and functionality.

Recent Activity

Recent issues and pull requests (PRs) reveal a focus on addressing critical bugs and enhancing documentation. Notable issues include #412, which highlights the unavailability of the Crossref API key, posing a significant obstacle for users. Issue #397 reports a critical bug related to missing models in the OpenAI API, affecting core functionalities. Additionally, performance concerns are raised in issue #408 regarding document ingestion speed.

The development team has been actively working on these challenges. Key contributors include:

James Braza: Leading with 40 commits, focusing on bug fixes, new features, and CI workflow improvements.
Andrew White: Contributed 25 commits, enhancing citation handling and metadata fetching.
Michael Skarlinski: With 29 commits, focused on code refactoring and agentic workflows.
Geemi Wellawatte: Made significant updates to Crossref client functionalities.

Recent Commits (Reverse Chronological)

James Braza: Created LitQAv2TaskDataset, fixed BaseModel defaults.
Andrew White: Merged main branch changes into feature branches.
Michael Skarlinski: Added high-quality settings for agentic workflows.
Geemi Wellawatte: Updated Crossref client functionalities.
Siddharth Narayanan: Improved search functionalities.

Of Note

The unavailability of the Crossref API key (#412) is a critical issue that requires immediate attention to maintain metadata access.
The transition to new versions has led to performance regressions (#408), indicating potential scalability concerns.
The introduction of .docx file support (#403) addresses a previously missing feature, enhancing document compatibility.
The project's claim of "superhuman" performance metrics in scientific tasks remains a bold assertion that distinguishes it from similar tools.
Collaboration among team members is evident, but challenges remain in aligning older PRs with the evolving codebase.

Quantified Reports

Quantify Issues

Recent GitHub Issues Activity

Timespan	Opened	Closed	Comments	Labeled	Milestones
7 Days	20	52	43	2	1
30 Days	24	53	46	6	1
90 Days	32	57	57	13	1
1 Year	64	65	126	38	1
All Time	165	128	-	-	-

_{Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.}

Quantify commits

Quantified Commit Activity Over 30 Days

Developer	Branches	PRs	Commits	Files	Changes
mskarlin	6	15/12/3	29	76	43458
Andrew White	2	12/13/0	25	61	20954
James Braza	3	46/40/4	40	46	6763
Geemi Wellawatte	2	3/3/0	19	9	626
Siddharth Narayanan	2	2/1/1	7	8	145
Tyler Nadolski (nadolskit)	1	1/0/0	4	3	99
Tabish Mir	1	2/1/0	1	1	30
Yusuf (Yusufibin)	0	0/0/1	0	0	0
Krish Dholakia (krrishdholakia)	0	0/0/1	0	0	0

_{PRs: created by that dev and opened/merged/closed-unmerged during the period}

Detailed Reports

Report On: Fetch issues

Recent Activity Analysis

The GitHub repository for the Future-House/paper-qa project currently has 37 open issues, with recent activity indicating a mix of questions, documentation requests, and bugs. Notably, there are several urgent inquiries regarding API key issues and performance regressions after recent updates. A common theme among the issues is the transition to new versions and the associated challenges users face, particularly with embedding models and API integrations.

Several issues stand out due to their implications for user experience and functionality. For instance, Issue #397 highlights a critical bug related to a missing model in the OpenAI API, which could affect users relying on specific functionalities. Additionally, Issue #381 discusses rate limits encountered during document indexing, suggesting potential scalability concerns as users attempt to process larger datasets. The presence of multiple questions about documentation and usage (e.g., #409, #402) indicates that users may struggle with understanding how to effectively utilize the tool's capabilities.

Issue Details

Most Recently Created Issues:

Issue #412: Not possible to get Crossref API Key - This tool is no longer available
- Priority: High
- Status: Open
- Created: 0 days ago
- Comments: Users are seeking alternatives for Crossref API access.
Issue #409: Documentation in CONTRIBUTING.md for pytest-recording and VCR cassettes
- Priority: Medium
- Status: Open
- Created: 0 days ago
- Comments: Request for documentation improvements.
Issue #408: Bypassing Some Pre-processing and Validation Steps in Pipeline for Faster Document Ingestion
- Priority: Medium
- Status: Open
- Created: 1 day ago
- Comments: User reports performance regression after an update.
Issue #402: CLI functionality in Python module
- Priority: Medium
- Status: Open
- Created: 1 day ago
- Comments: User seeks clarification on using CLI features within Python.
Issue #399: Dependency Dashboard
- Priority: Low
- Status: Open
- Created: 1 day ago
- Comments: Automated report on dependency updates.

Most Recently Updated Issues:

Issue #397: Error Code 404
- Priority: High
- Status: Open
- Last Updated: 1 day ago
- Comments: Critical error related to missing model access.
Issue #393: Azure Open AI API KEY
- Priority: Medium
- Status: Open
- Last Updated: 2 days ago
- Comments: Inquiry about using Azure's API instead of OpenAI's.
Issue #392: Installation problem
- Priority: Medium
- Status: Open
- Last Updated: 2 days ago
- Comments: User faces installation issues on MacOS.
Issue #391: How to increase the number of citations in the answers?
- Priority: Medium
- Status: Open
- Last Updated: 2 days ago
- Comments: User requests guidance on citation parameters.
Issue #390: How to use Version 5 with LiteLLM and Ollama?
- Priority: Medium
- Status: Open
- Last Updated: 2 days ago
- Comments: User seeks examples for integrating new features.

Summary

The recent activity in the Future-House/paper-qa GitHub repository reflects a dynamic environment where users are actively engaging with the tool while encountering various challenges related to updates, documentation clarity, and integration with APIs. The issues raised indicate a need for better guidance on using new features and addressing critical bugs that could hinder user experience.

Report On: Fetch pull requests

Overview

The analysis of the pull requests (PRs) from the Future-House/paper-qa repository reveals a mix of ongoing enhancements, bug fixes, and documentation updates. The repository currently has 9 open PRs and 221 closed PRs, indicating a vibrant development activity with a focus on improving functionality and usability.

Summary of Pull Requests

Open Pull Requests

PR #411: Broken title search ut
Created by Tyler Nadolski. This PR addresses a failing unit test related to citation counts from different sources. It highlights discrepancies in citation counts based on the order of sources queried. Review comments suggest improvements in regex usage and code style.
PR #410: pytest-recording docs in CONTRIBUTING.md
Created by James Braza. This PR adds documentation for using the pytest-recording plugin, enhancing the contribution guidelines for developers.
PR #407: Promoting agent factories to Settings
Created by James Braza. This enhancement improves encapsulation by moving agent factories into a settings module, streamlining imports.
PR #403: Add support to read docx files
Created by Tabish Mir. This PR introduces functionality to parse .docx files, addressing a previously missing feature while also fixing case sensitivity issues in file handling.
PR #205: Add docx reader
Created by Nish (NISH1001). This older PR proposes adding a reader for .docx files but has not been updated since significant changes were made to the repository.
PR #177: Batch summarisation
Created by Zac Pullar-Strecker. This PR aims to enhance performance for batch summarization tasks but also requires rebasing due to recent changes in the main branch.
PR #131: Implement adversarial prompting
Created by David Brodrick. This PR introduces a method for adversarial prompting, which enhances answer quality but also needs updates following major refactoring in the repository.
PR #82: copy from Zotero storage
Created by Gabriel Simmons. This PR aims to improve file handling by copying PDFs from Zotero storage rather than downloading them, but it has not been updated since v5 was released.
PR #81: Better BibTeX citekeys
Created by Gabriel Simmons. This PR adds support for Better BibTeX citekeys but also requires rebasing due to recent changes.

Closed Pull Requests

A notable trend in closed PRs includes numerous enhancements aimed at improving CI/CD processes, fixing bugs related to testing frameworks, and refining documentation.
Many closed PRs indicate a proactive approach to code quality, with several focused on passing type checks (mypy) and enhancing testing reliability.
The project has seen significant contributions towards improving its overall architecture, such as moving components into more logical structures (e.g., agent factories).

Analysis of Pull Requests

The current landscape of pull requests in the Future-House/paper-qa repository reflects an active development cycle characterized by both ongoing enhancements and necessary bug fixes. The open pull requests indicate that contributors are focusing on critical areas such as improving test reliability (e.g., PR #411), enhancing documentation (e.g., PR #410), and adding new features like support for .docx files (e.g., PR #403).

One notable aspect is the presence of older pull requests that have not been merged or updated since significant changes were made to the repository (e.g., PR #205 and PR #177). This suggests potential challenges in maintaining alignment with the evolving codebase, which can lead to contributor frustration and hinder project momentum if not addressed promptly.

Moreover, there is an evident emphasis on improving code quality through type checking and linting processes, as seen in multiple closed pull requests that focus on passing mypy checks and integrating tools like pylint. This commitment to maintaining high code quality standards is commendable and essential for long-term project sustainability.

In terms of collaboration dynamics, review comments on several open pull requests show constructive feedback aimed at improving code practices and ensuring adherence to project standards (e.g., regex improvements suggested in PR #411). However, some contributors express difficulty keeping up with recent changes, indicating a need for better communication regarding major updates or changes in direction within the project.

The project's evolution from PaperQA1 to PaperQA2 signifies substantial architectural shifts aimed at enhancing performance and usability. The introduction of features like agentic workflows and a user-friendly CLI demonstrates an intent to cater to researchers' needs more effectively.

In conclusion, while the project exhibits robust activity levels with numerous contributions aimed at enhancing functionality and usability, it faces challenges related to managing older pull requests and ensuring contributors remain aligned with ongoing developments. Addressing these issues will be crucial for maintaining momentum and fostering a collaborative development environment moving forward.

Report On: Fetch commits

Repo Commits Analysis

Development Team and Recent Activity

Team Members and Their Recent Activities

James Braza (jamesbraza)
- Recent Commits: 40 commits in the last 30 days.
- Key Contributions:
- Created LitQAv2TaskDataset for agent training/evaluation.
- Fixed various bugs including mutable BaseModel defaults and crashes in chunk_text.
- Added new features like Renovate config and improved CI workflows.
- Collaborated with others on README updates and documentation improvements.
- In Progress: Ongoing work in multiple branches, including factories-in-settings.
Andrew White (whitead)
- Recent Commits: 25 commits in the last 30 days.
- Key Contributions:
- Merged changes from the main branch into feature branches.
- Updated README and added explanations for various components.
- Worked on improving citation handling and metadata fetching.
- In Progress: Active in the issue-366 branch.
Michael Skarlinski (mskarlin)
- Recent Commits: 29 commits in the last 30 days.
- Key Contributions:
- Added high-quality settings and improved agentic workflows.
- Refactored code to enhance readability and maintainability.
- Collaborated on multiple features including new configurations for agents.
- In Progress: Work ongoing in several branches, including test-speed.
Geemi Wellawatte (geemi725)
- Recent Commits: 19 commits in the last 30 days.
- Key Contributions:
- Made significant updates to Crossref client functionalities.
- Enhanced error handling and modularized code for better maintainability.
- In Progress: Active in the issue-366 branch.
Siddharth Narayanan (sidnarayanan)
- Recent Commits: 7 commits in the last 30 days.
- Key Contributions:
- Focused on improving search functionalities and handling of journal data.
- Participated in refactoring efforts to clean up code structure.
Tyler Nadolski (nadolskit)
- Recent Commits: 4 commits in the last 30 days.
- Key Contributions:
- Updated unit tests and made minor adjustments to improve code quality.
Tabish Mir (taabishm2)
- Recent Commits: 1 commit in the last 30 days.
- Key Contributions:
- Contributed to README improvements.
Yusufibin & Krrish Dholakia
- No recent activity reported.

Patterns, Themes, and Conclusions

The majority of recent activity is led by James Braza, indicating he is a key contributor driving significant changes within the project, particularly around feature development and bug fixes.
Collaboration is evident among team members, especially with co-authored commits on documentation and feature enhancements, which suggests a cohesive team dynamic focused on improving usability and performance of PaperQA2.
There is a strong emphasis on enhancing testing frameworks and documentation, reflecting a commitment to maintaining code quality and user guidance as the project evolves.
The presence of multiple active branches indicates ongoing development across various features, with some members focusing on specific areas like error handling, metadata management, and user interface improvements.
The team's collective efforts are aimed at refining existing functionalities while introducing new features that align with the project's goals of providing high accuracy in scientific document retrieval and question answering.