The "apify/crawlee" project, a Node.js library for web scraping and browser automation, continues to grapple with memory management issues, notably session pool growth and memory leaks, while maintaining a robust development pace.
The Crawlee library is designed to facilitate efficient web crawling and data extraction, supporting both headful and headless modes. It integrates with popular tools like Puppeteer and Playwright and offers features like proxy rotation and automatic scaling.
Recent issues highlight ongoing challenges with memory management, such as #2074 concerning indefinite session pool growth and #1845 addressing memory leaks. These issues suggest a need for improved resource handling within the library. Concurrently, feature requests like support for additional HTTP status codes (#1710) and enhancements in proxy management (#2065) indicate user demand for expanded functionality.
The development team remains active, with notable contributions from:
RequestQueueV2
and compatibility improvements.This activity reflects a focus on maintenance, bug resolution, and documentation enhancement.
Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 1 | 2 | 0 | 0 | 1 |
30 Days | 10 | 7 | 10 | 0 | 1 |
90 Days | 29 | 26 | 29 | 1 | 2 |
1 Year | 159 | 126 | 281 | 2 | 10 |
All Time | 866 | 755 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
renovate[bot] | 7 | 7/3/4 | 23 | 14 | 4639 | |
Saurav Jain | 4 | 6/3/1 | 10 | 15 | 1509 | |
Martin Adámek | 2 | 4/3/0 | 8 | 16 | 1315 | |
Apify Release Bot | 1 | 0/0/0 | 14 | 36 | 1293 | |
Jan Buchar (janbuchar) | 1 | 1/0/0 | 6 | 6 | 303 | |
Jindřich Bär | 1 | 1/1/0 | 1 | 2 | 140 | |
Vlada Dusek | 1 | 1/1/0 | 1 | 2 | 64 | |
Vlad Frangu | 2 | 3/2/0 | 3 | 3 | 29 | |
Daniel Wébr | 1 | 1/1/0 | 1 | 1 | 14 | |
Joe Leonard | 1 | 1/1/0 | 1 | 2 | 4 | |
Ikko Eltociear Ashimine | 1 | 1/1/0 | 1 | 1 | 2 | |
None (Pokrt) | 0 | 1/0/0 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
The recent activity in the "apify/crawlee" repository shows a diverse range of issues being reported and addressed, with a focus on enhancing functionality, fixing bugs, and improving documentation. The issues span various components of the library, including PlaywrightCrawler, CheerioCrawler, and memory storage.
Notable anomalies include recurring problems with memory management, such as issues with session pools growing indefinitely (#2074) and memory leaks (#1845). There are also several reports of unexpected behavior when running multiple crawlers or using specific configurations, indicating potential areas for improvement in concurrency handling and configuration management.
Common themes among the issues include requests for new features like support for additional HTTP status codes (#1710), improvements to existing functionalities like proxy management (#2065), and enhancements to documentation and user guidance (#1715). Several issues also highlight the need for better error handling and more informative logging to aid debugging.
#2669: forefront option doesn't work when persistStorage is false
#2659: HTTP client switching
#2654: remove all enums
#2669: forefront option doesn't work when persistStorage is false
#2653: Failed to prolong lock for cached request.
#2606: Node crash on Crawlee running fs.stat on a request_queue lock file
These issues reflect ongoing efforts to address critical bugs affecting performance and stability, as well as requests for feature enhancements that could improve the flexibility and usability of the Crawlee library.
The dataset provides a list of open and closed pull requests (PRs) for the "apify/crawlee" repository, which is a web scraping and browser automation library for Node.js. The PRs cover various updates, fixes, and enhancements to the project.
inquirer
to v11. Created by renovate[bot], this PR updates the inquirer
package to the latest version.tough-cookie
to v5 by renovate[bot].RequestQueueV2
by Vlad Frangu.puppeteer
to v23 by renovate[bot].vite-tsconfig-paths
to v5 by renovate[bot].minimatch
to v10 by renovate[bot].@types/inquirer
to v9 by renovate[bot].utils.parseOpenGraph()
for better parsing capabilities by David Ball.Request
class by Matt Stephens.RequestQueueV2
if pending too long by Vlad Frangu.inquirer
to v10 - autoclosed.FACEBOOK_REGEX
for older style URLs by Joe Leonard.The pull requests for the "apify/crawlee" repository reveal several key themes and activities within the project:
Dependency Management: A significant portion of the PRs, such as #2670, #2663, #2605, and #2607, focus on updating dependencies like inquirer
, tough-cookie
, puppeteer
, and others to their latest versions using automated tools like Renovate Bot. This indicates a proactive approach to maintaining up-to-date software components, which is crucial for security and performance.
Documentation Enhancements: Several PRs (#2665, #2630, #2477) are dedicated to improving documentation, reflecting an emphasis on user experience and ease of understanding for developers using Crawlee.
Refactoring and Code Improvements: PRs like #2661 demonstrate efforts to refactor code for better modularity and maintainability, such as decoupling the HTTP client from other components.
Feature Additions: New features are being introduced, such as request-specific timeouts (#1560) and improved Open Graph parsing (#2521), which enhance the functionality and flexibility of Crawlee.
Bug Fixes and Performance Improvements: Some PRs address specific bugs or performance issues, such as fixing regex patterns (#2650) or optimizing request queue handling (#2656).
Community Contributions: The presence of contributions from multiple authors, including external contributors like Pokrt and David Ball, highlights active community involvement in the project.
Overall, the pull requests reflect a healthy balance between maintenance tasks (such as dependency updates), feature development, bug fixes, and documentation improvements, indicating active development and community engagement in the Crawlee project. However, there are some older PRs like #1560 that remain open for extended periods, suggesting potential areas where prioritization or resource allocation could be improved to accelerate progress on long-standing issues or features.
Apify Release Bot
Vlada Dusek (vdusek)
Vlad Frangu (vladfrangu)
RequestQueueV2
and resolved incompatibility between Turbo and Yarn with Node 16.Daniel Wébr (webrdaniel)
Renovate Bot (renovate[bot])
Joe Leonard (gijoehosaphat)
FACEBOOK_REGEX
to match older style page URLs.Martin Adámek (B4nan)
Saurav Jain (souravjain540)
Jindřich Bär (barjin)
globs
& regexps
for SitemapRequestList
.Ikko Eltociear Ashimine (eltociear)
Jan Buchar (janbuchar)