‹ Reports
The Dispatch

OSS Report: apify/crawlee


Crawlee Project Faces Persistent Memory Management Challenges Amidst Active Development

The "apify/crawlee" project, a Node.js library for web scraping and browser automation, continues to grapple with memory management issues, notably session pool growth and memory leaks, while maintaining a robust development pace.

The Crawlee library is designed to facilitate efficient web crawling and data extraction, supporting both headful and headless modes. It integrates with popular tools like Puppeteer and Playwright and offers features like proxy rotation and automatic scaling.

Recent Activity

Recent issues highlight ongoing challenges with memory management, such as #2074 concerning indefinite session pool growth and #1845 addressing memory leaks. These issues suggest a need for improved resource handling within the library. Concurrently, feature requests like support for additional HTTP status codes (#1710) and enhancements in proxy management (#2065) indicate user demand for expanded functionality.

The development team remains active, with notable contributions from:

This activity reflects a focus on maintenance, bug resolution, and documentation enhancement.

Of Note

  1. Memory Management Issues: Persistent problems with session pools and memory leaks are critical areas needing attention.
  2. Dependency Updates: Regular updates via Renovate Bot ensure the project remains current with external libraries.
  3. Documentation Focus: Multiple contributors are enhancing documentation, indicating an emphasis on user support.
  4. Community Engagement: Contributions from various authors highlight active community involvement.
  5. Feature Development: New features like request-specific timeouts (#1560) are being developed to enhance library capabilities.

Quantified Reports

Quantify Issues



Recent GitHub Issues Activity

Timespan Opened Closed Comments Labeled Milestones
7 Days 1 2 0 0 1
30 Days 10 7 10 0 1
90 Days 29 26 29 1 2
1 Year 159 126 281 2 10
All Time 866 755 - - -

Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.

Quantify commits



Quantified Commit Activity Over 30 Days

Developer Avatar Branches PRs Commits Files Changes
renovate[bot] 7 7/3/4 23 14 4639
Saurav Jain 4 6/3/1 10 15 1509
Martin Adámek 2 4/3/0 8 16 1315
Apify Release Bot 1 0/0/0 14 36 1293
Jan Buchar (janbuchar) 1 1/0/0 6 6 303
Jindřich Bär 1 1/1/0 1 2 140
Vlada Dusek 1 1/1/0 1 2 64
Vlad Frangu 2 3/2/0 3 3 29
Daniel Wébr 1 1/1/0 1 1 14
Joe Leonard 1 1/1/0 1 2 4
Ikko Eltociear Ashimine 1 1/1/0 1 1 2
None (Pokrt) 0 1/0/0 0 0 0

PRs: created by that dev and opened/merged/closed-unmerged during the period

Detailed Reports

Report On: Fetch issues



GitHub Issues Analysis

Recent Activity Analysis

The recent activity in the "apify/crawlee" repository shows a diverse range of issues being reported and addressed, with a focus on enhancing functionality, fixing bugs, and improving documentation. The issues span various components of the library, including PlaywrightCrawler, CheerioCrawler, and memory storage.

Notable anomalies include recurring problems with memory management, such as issues with session pools growing indefinitely (#2074) and memory leaks (#1845). There are also several reports of unexpected behavior when running multiple crawlers or using specific configurations, indicating potential areas for improvement in concurrency handling and configuration management.

Common themes among the issues include requests for new features like support for additional HTTP status codes (#1710), improvements to existing functionalities like proxy management (#2065), and enhancements to documentation and user guidance (#1715). Several issues also highlight the need for better error handling and more informative logging to aid debugging.

Issue Details

Most Recently Created Issues

  1. #2669: forefront option doesn't work when persistStorage is false

    • Priority: High
    • Status: Open
    • Created: 3 days ago
    • Updated: Today
  2. #2659: HTTP client switching

    • Priority: Medium
    • Status: Open
    • Created: 12 days ago
  3. #2654: remove all enums

    • Priority: Low
    • Status: Open
    • Created: 14 days ago

Most Recently Updated Issues

  1. #2669: forefront option doesn't work when persistStorage is false

    • Priority: High
    • Status: Open
    • Created: 3 days ago
    • Updated: Today
  2. #2653: Failed to prolong lock for cached request.

    • Priority: Medium
    • Status: Open
    • Created: 14 days ago
    • Updated: Today
  3. #2606: Node crash on Crawlee running fs.stat on a request_queue lock file

    • Priority: High
    • Status: Open
    • Created: 40 days ago
    • Updated: 13 days ago

These issues reflect ongoing efforts to address critical bugs affecting performance and stability, as well as requests for feature enhancements that could improve the flexibility and usability of the Crawlee library.

Report On: Fetch pull requests



Overview

The dataset provides a list of open and closed pull requests (PRs) for the "apify/crawlee" repository, which is a web scraping and browser automation library for Node.js. The PRs cover various updates, fixes, and enhancements to the project.

Summary of Pull Requests

Open Pull Requests

  1. #2670: Update dependency inquirer to v11. Created by renovate[bot], this PR updates the inquirer package to the latest version.
  2. #2665: Documentation update for new blog and minor changes by Saurav Jain. It adds a new blog and references to a Crawlee Python tutorial.
  3. #2664: Update blog sidebar to view all posts by Saurav Jain.
  4. #2663: Update dependency tough-cookie to v5 by renovate[bot].
  5. #2661: Refactor to decouple HTTP client by Jan Buchar.
  6. #2660: Update patch/minor dependencies by renovate[bot].
  7. #2658: Lock file maintenance by renovate[bot].
  8. #2652: Implement an escape hatch for deadlock state in RequestQueueV2 by Vlad Frangu.
  9. #2630: Documentation update for environment variables by Pokrt.
  10. #2605: Update dependency puppeteer to v23 by renovate[bot].
  11. #2645: Update Puppeteer to v23 by Martin Adámek.
  12. #2607: Update dependency vite-tsconfig-paths to v5 by renovate[bot].
  13. #2574: Update dependency minimatch to v10 by renovate[bot].
  14. #2569: Update dependency @types/inquirer to v9 by renovate[bot].
  15. #2521: Rewrote utils.parseOpenGraph() for better parsing capabilities by David Ball.
  16. #2477: Clarify AWS Lambda storage documentation by Connor Adams.
  17. #1560: Add request-specific timeout options in Request class by Matt Stephens.

Closed Pull Requests

  1. #2667: Documentation changes regarding updated structure of Python templates by Vlada Dusek.
  2. #2657: Fix turbo and yarn incompatibility with Node 16 by Vlad Frangu.
  3. #2656: Reset recently handled cache in RequestQueueV2 if pending too long by Vlad Frangu.
  4. #2651: Add Intercom messenger script for AI chat bot by Daniel Wébr.
  5. #2570: Update dependency inquirer to v10 - autoclosed.
  6. #2650: Improve FACEBOOK_REGEX for older style URLs by Joe Leonard.
  7. #2644: Update Playwright to 1.46 by Martin Adámek.
  8. #2642: Update dev dependencies by Martin Adámek.
  9. #2641: Use namespace imports for Cheerio compatibility with v1 by Martin Adámek.
  10. #2640: Update dependency turbo to v2 - autoclosed.

Analysis of Pull Requests

The pull requests for the "apify/crawlee" repository reveal several key themes and activities within the project:

  1. Dependency Management: A significant portion of the PRs, such as #2670, #2663, #2605, and #2607, focus on updating dependencies like inquirer, tough-cookie, puppeteer, and others to their latest versions using automated tools like Renovate Bot. This indicates a proactive approach to maintaining up-to-date software components, which is crucial for security and performance.

  2. Documentation Enhancements: Several PRs (#2665, #2630, #2477) are dedicated to improving documentation, reflecting an emphasis on user experience and ease of understanding for developers using Crawlee.

  3. Refactoring and Code Improvements: PRs like #2661 demonstrate efforts to refactor code for better modularity and maintainability, such as decoupling the HTTP client from other components.

  4. Feature Additions: New features are being introduced, such as request-specific timeouts (#1560) and improved Open Graph parsing (#2521), which enhance the functionality and flexibility of Crawlee.

  5. Bug Fixes and Performance Improvements: Some PRs address specific bugs or performance issues, such as fixing regex patterns (#2650) or optimizing request queue handling (#2656).

  6. Community Contributions: The presence of contributions from multiple authors, including external contributors like Pokrt and David Ball, highlights active community involvement in the project.

Overall, the pull requests reflect a healthy balance between maintenance tasks (such as dependency updates), feature development, bug fixes, and documentation improvements, indicating active development and community engagement in the Crawlee project. However, there are some older PRs like #1560 that remain open for extended periods, suggesting potential areas where prioritization or resource allocation could be improved to accelerate progress on long-standing issues or features.

Report On: Fetch commits



Repo Commits Analysis

Development Team and Recent Activity

Team Members and Activities

  • Apify Release Bot

    • Frequent updates to Docker state and internal dependencies.
    • Released versions v3.11.3, v3.11.2, and v3.10.5.
    • A total of 14 commits with changes across 36 files.
  • Vlada Dusek (vdusek)

    • Worked on documentation changes related to updated structure of Python templates.
    • Made a single commit affecting 2 files.
  • Vlad Frangu (vladfrangu)

    • Fixed issues in RequestQueueV2 and resolved incompatibility between Turbo and Yarn with Node 16.
    • Contributed to reducing filesystem calls for listing heads.
    • Made 3 commits with changes across 3 files.
  • Daniel Wébr (webrdaniel)

    • Added Intercom messenger script for AI chatbot integration.
    • Made a single commit affecting one file.
  • Renovate Bot (renovate[bot])

    • Conducted extensive dependency updates and lock file maintenance.
    • Made 23 commits with changes across 14 files.
  • Joe Leonard (gijoehosaphat)

    • Improved FACEBOOK_REGEX to match older style page URLs.
    • Made a single commit affecting 2 files.
  • Martin Adámek (B4nan)

    • Updated Playwright to version 1.46, fixed navbar responsiveness, and made various formatting changes.
    • Made a total of 8 commits with changes across 16 files.
  • Saurav Jain (souravjain540)

    • Added multiple blog posts and made SEO-related documentation changes.
    • Made a total of 10 commits with changes across 15 files.
  • Jindřich Bär (barjin)

    • Implemented globs & regexps for SitemapRequestList.
    • Made a single commit affecting two files.
  • Ikko Eltociear Ashimine (eltociear)

    • Corrected a minor typo in AutoscaledPool documentation.
    • Made a single commit affecting one file.
  • Jan Buchar (janbuchar)

    • Introduced BaseHttpClient interface and finalized using HttpClient in send_request.
    • Made a total of 6 commits with changes across six files.

Patterns, Themes, and Conclusions

  • The development team is actively maintaining the project with frequent updates, especially focusing on dependency management through Renovate Bot and Apify Release Bot.
  • There is a strong emphasis on documentation updates, as seen from multiple contributions by different team members including Saurav Jain and Vlada Dusek.
  • The team is addressing both new feature implementations, such as the introduction of new HTTP client interfaces by Jan Buchar, and bug fixes, like those handled by Vlad Frangu.
  • Collaboration appears to be limited in terms of direct co-authored commits, but there is evidence of coordinated efforts in maintaining the project's infrastructure and documentation.
  • The project continues to evolve with regular releases, indicating an active development cycle aimed at improving functionality and maintaining compatibility with external dependencies.