ScrapeGraphAI, a Python library for web scraping using large language models and graph logic, has seen no significant development activity in the past 30 days despite maintaining strong community interest with over 13,000 stars and 1,000 forks.
Recent issues and pull requests (PRs) indicate a focus on expanding model support and enhancing scraping capabilities. Notable open PRs include #559, which adds configuration for using requests
instead of Langchain for OpenAI interactions, and #558, which integrates a screenshot scraper. These efforts suggest a trajectory towards increasing flexibility and feature set. However, closed PRs like #557 and #556, both related to tokenization improvements, were not merged, highlighting potential internal disagreements or prioritization challenges.
Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
Marco Vinciguerra | 5 | 25/20/4 | 95 | 278 | 9131 | |
Federico Aguzzi | 5 | 7/7/0 | 36 | 135 | 4782 | |
Marco Perini | 1 | 1/1/0 | 9 | 52 | 2410 | |
Semantic Release Bot | 3 | 0/0/0 | 47 | 2 | 696 | |
Matteo Vedovati | 1 | 3/3/0 | 4 | 8 | 334 | |
amosdinh | 1 | 2/1/0 | 1 | 1 | 10 | |
amazeqiu | 1 | 0/0/0 | 1 | 1 | 9 | |
Nafay Rizwani | 1 | 1/1/0 | 1 | 1 | 8 | |
DragonelRoland | 1 | 2/2/0 | 3 | 3 | 8 | |
Evan Lin | 1 | 1/1/0 | 1 | 1 | 3 | |
Aziz Ullah Khan | 1 | 1/1/0 | 1 | 1 | 2 | |
None (sandeepchittilla) | 1 | 2/1/1 | 1 | 1 | 2 | |
Federico Minutoli | 0 | 0/0/0 | 0 | 0 | 0 | |
Lorenzo Padoan | 0 | 0/0/0 | 0 | 0 | 0 | |
None (AmazeQiu) | 0 | 1/1/0 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 10 | 10 | 19 | 9 | 1 |
30 Days | 37 | 44 | 79 | 29 | 2 |
90 Days | 113 | 106 | 339 | 83 | 2 |
All Time | 218 | 198 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
The ScrapeGraphAI project has seen a notable uptick in recent activity, with 20 open issues currently being tracked. Among these, several issues highlight significant user challenges and feature requests that could impact the project's usability and functionality. Common themes include requests for enhanced support for various language models, integration with additional APIs, and improvements to existing scraping capabilities.
Several issues reflect user frustration with the current limitations of the library, particularly regarding model support and error handling. For instance, users have reported frequent errors related to unsupported models and missing features, indicating a need for more robust documentation and clearer guidance on usage.
Issue #560: Whether it can support llm models compatible with the open AI api protocol?
Issue #554: Improve tokenization function
Issue #545: embedder_model
AttributeError in /examples/openai/deep_scraper_openai.py
Issue #544: Scrapegraph returns relative path URLs instead of absolute path
Issue #543: Context length exceeded
Issue #547: Different prompt_tokens size between Azure OpenAI and OpenAI
Issue #540: The Dockerfile doesn't work well
Issue #493: Support for firecrawl
The recent issues indicate a strong demand for:
There is also a recurring theme of users encountering technical barriers due to unsupported features or bugs that hinder their ability to effectively utilize the library. Addressing these concerns could significantly improve user satisfaction and engagement with the project.
The analysis of the pull requests (PRs) for the ScrapeGraphAI project reveals a dynamic and active development environment. Currently, there are three open PRs and a significant number of closed PRs, indicating ongoing enhancements and bug fixes within the library.
PR #559: Add configuration for using requests instead of Langchain OpenAI
Created by Marco Vinciguerra, this PR introduces a configuration option to utilize the requests
library for OpenAI interactions, potentially improving flexibility and reducing dependencies on Langchain. It modifies several files, including generate_answer_node.py
, which sees a substantial increase in lines added (+119).
PR #558: Screenshot scraper integration
This PR by Marco Vinciguerra adds a new screenshot scraping feature. Review comments highlight concerns about duplicate data structures and potential issues with infinite-scrolling web pages. The PR includes multiple new files and significant code refactoring.
PR #495: Add support for Firecrawl
Introduced by Amos Dinh, this PR adapts the Search Node to utilize Firecrawl instead of Playwright for web searches. Comments from reviewers express skepticism about adding more scraping backends without a unified factory approach, indicating potential architectural concerns.
PR #557: Improve tokenization function
Closed without merging, this PR aimed to enhance tokenization but was not accepted, possibly due to overlapping changes or lack of consensus on implementation.
PR #556: Improve tokenization function
Similar to #557, this PR was also closed without merging, suggesting that the proposed changes were not deemed necessary or were superseded by other efforts.
PR #555: Add OpenAI markdown optimization to Azure
Merged successfully, this PR improves the FetchNode functionality by optimizing how markdown is handled when using OpenAI models on Azure.
PR #552: Add support for gpt-4o variant models with different pricing
Merged successfully, this PR introduces support for new model variants that offer structured responses and different pricing options.
The recent activity in the ScrapeGraphAI repository reflects a robust development cycle focused on enhancing functionality and addressing user needs. The open pull requests indicate an ongoing effort to integrate new features such as screenshot scraping and alternative libraries like Firecrawl. These additions suggest a strategic move towards increasing the library's versatility in handling various scraping scenarios.
Notably, the discussions surrounding PR #558 reveal a critical examination of code structure and performance implications. The reviewer’s comments about duplicate data structures and potential issues with dynamic content handling underscore a commitment to maintaining code quality and performance efficiency. This level of scrutiny is essential in collaborative projects where multiple contributors may introduce varying styles and practices.
The closed pull requests provide insight into the decision-making process within the team. The rejection of two tokenization improvement proposals (#557 and #556) suggests that either the changes were not aligned with current project goals or that alternative solutions were preferred. This highlights an adaptive approach to development where feedback is actively considered before proceeding with changes.
Furthermore, the successful merges of PRs like #555 and #552 demonstrate a proactive response to evolving user requirements, particularly in integrating new AI models and optimizing existing functionalities. The addition of support for gpt-4o variants indicates an awareness of market trends in AI capabilities, ensuring that ScrapeGraphAI remains competitive in its offerings.
In conclusion, the pull request activity within ScrapeGraphAI illustrates a vibrant community dedicated to continuous improvement. The balance between introducing innovative features while maintaining code integrity through rigorous review processes is commendable. As the project evolves, it will be crucial to sustain this momentum while addressing architectural concerns raised during reviews to ensure long-term maintainability and scalability.
Marco Vinciguerra (VinciGit00)
Aziz Ullah Khan (aziz-ullah-khan)
models_tokens.py
.Nafay Rizwani (Nafay-0)
llm.rst
.Federico Aguzzi (f-aguzzi)
Matteo Vedovati (vedovati-matteo)
DiTo97 (Federico Minutoli)
Lorenzo Padoan (lurenss)
AmosDinh
Feature Development:
Bug Fixes:
Documentation Updates:
Collaboration Patterns:
In Progress Work:
The development team is functioning effectively with clear roles, collaborative efforts, and a focus on enhancing the ScrapeGraphAI library's capabilities. The recent activities indicate a proactive approach to both feature enhancement and bug resolution, contributing to the project's overall growth and stability.