‹ Reports
The Dispatch

OSS Report: apify/crawlee-python


Critical Bugs and Enhancements Dominate Crawlee-Python's Development Landscape

The apify/crawlee-python project is actively addressing critical bugs and enhancements, with particular attention to issues like the item_count double increment (#442) and URL validation problems (#417), which are crucial for maintaining user trust and functionality. Crawlee is a Python library designed for web scraping and browser automation, enhancing data extraction capabilities with tools like BeautifulSoup and Playwright.

Recent Activity

Recent issues and pull requests (PRs) reflect a concentrated effort on improving the library's core functionalities and user experience. The critical bugs, such as the item_count issue (#442), are being tackled alongside enhancements like memory management configurability (#434). This suggests a dual focus on immediate problem resolution and long-term capability expansion.

Development Team and Recent Contributions

  1. Vlada Dusek (vdusek)

    • Updated dependencies and documentation.
    • Worked on HTTP client features and proxy management.
  2. Jan Buchar (janbuchar)

    • Developed ParselCrawler and improved request handling.
    • Fixed request dequeueing order and enhanced CLI error handling.
  3. Saurav Jain (souravjain540)

    • Focused on documentation updates, removing outdated content.
  4. Martin Adámek (B4nan)

    • Enhanced documentation structure and responsiveness.
  5. Renovate Bot (renovate[bot])

    • Automated dependency updates across various files.
  6. Apify Release Bot

    • Managed release processes, updating changelogs and package versions.

Of Note

Quantified Reports

Quantify commits



Quantified Commit Activity Over 30 Days

Developer Avatar Branches PRs Commits Files Changes
renovate[bot] 2 32/30/3 31 5 4301
Jan Buchar 2 27/28/1 34 55 3108
Vlada Dusek 2 11/9/1 10 71 2448
Martin Adámek 1 0/0/0 2 2 1213
asymness 1 1/1/0 1 8 469
Apify Release Bot 1 0/0/0 12 2 448
Saurav Jain 1 6/6/0 6 5 40
TymeeK 1 0/1/0 1 4 34
Fauzaan Gasim 1 1/1/0 1 1 2
Gianluigi Tiesi (sherpya) 0 1/0/0 0 0 0
MS_Y (black7375) 0 1/0/0 0 0 0
Mat (cadlagtrader) 0 1/0/0 0 0 0

PRs: created by that dev and opened/merged/closed-unmerged during the period

Quantify Issues



Recent GitHub Issues Activity

Timespan Opened Closed Comments Labeled Milestones
7 Days 5 1 3 0 2
30 Days 27 20 47 3 2
90 Days 80 56 97 3 7
1 Year 148 86 162 5 18
All Time 149 87 - - -

Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.

Detailed Reports

Report On: Fetch issues



Recent Activity Analysis

The recent activity in the apify/crawlee-python GitHub repository indicates a vibrant development environment, with 62 open issues and ongoing discussions around enhancements and bug fixes. Notably, issues related to bugs and enhancements in tooling are prevalent, suggesting a focus on improving the library's functionality and user experience. There are several critical bugs, such as the item_count double increment issue (#442) and URL validation problems (#417), which could significantly impact users if not addressed promptly.

Common themes among the issues include enhancements to memory management, request handling, and documentation improvements. The presence of multiple enhancement requests indicates that the community is actively seeking to expand Crawlee's capabilities, particularly in areas like configuration options and performance optimizations.

Issue Details

Recently Created Issues

  1. Issue #442: item_count double incremented when reloading dataset

    • Priority: High
    • Status: Open
    • Created: 2 days ago
    • Updated: 0 days ago
    • Description: This bug causes item_count to increment incorrectly when reusing datasets with metadata, leading to inconsistencies in data handling.
  2. Issue #434: Make memory-related parameters of Snapshotter configurable via Configuration

    • Priority: Medium
    • Status: Open
    • Created: 6 days ago
    • Updated: N/A
    • Description: A proposal to enhance configurability for memory management within the Snapshotter component.
  3. Issue #433: Unify crawlee.memory_storage_client.request_queue_client with JS counterpart

    • Priority: Medium
    • Status: Open
    • Created: 6 days ago
    • Updated: N/A
    • Description: This enhancement aims to align Python's memory storage client with its JavaScript equivalent for consistency across platforms.
  4. Issue #427: Does the crawlee-python support preNavigationHooks?

    • Priority: Low
    • Status: Open
    • Created: 8 days ago
    • Updated: N/A
    • Description: A user inquiry regarding support for pre-navigation hooks, which are currently not implemented.
  5. Issue #417: URL Validation edge case - Protocol/Scheme relative URLs

    • Priority: High
    • Status: Open
    • Created: 11 days ago
    • Updated: 1 day ago
    • Description: This bug highlights issues with validating protocol-relative URLs, which can lead to crawl failures.

Recently Updated Issues

  1. Issue #354: Crawling very slow and timeout error

    • Priority: High
    • Status: Open
    • Created: 28 days ago
    • Updated: 6 days ago
    • Description: Users report significant performance degradation after prolonged crawling sessions, raising concerns about memory management and queue size.
  2. Issue #304: Improve API docs of public components

    • Priority: Medium
    • Status: Open
    • Created: 37 days ago
    • Updated: 2 days ago
    • Description: A request to enhance the API documentation for better clarity and usability.
  3. Issue #203: Request fetching from RequestQueue is sometimes very slow

    • Priority: High
    • Status: Open
    • Created: 62 days ago
    • Updated: 1 day ago
    • Description: Reports indicate that fetching requests from the queue can be sluggish, potentially affecting overall crawler performance.

Summary of Observations

The issues reflect a mix of urgent bugs that could hinder user experience and ongoing enhancements aimed at expanding functionality. The presence of critical bugs related to data handling and performance suggests that immediate attention is required to maintain user trust and satisfaction in this rapidly evolving project. The community's active engagement in proposing enhancements indicates a strong interest in improving Crawlee's capabilities further.

Report On: Fetch pull requests



Report on Pull Requests

Overview

The analysis of the pull requests (PRs) for the apify/crawlee-python repository reveals a total of 6 open PRs and 289 closed PRs. The recent activity indicates a focus on tooling improvements, bug fixes, and dependency updates, with notable discussions around code quality and testing practices.

Summary of Pull Requests

Open Pull Requests

  • PR #447: chore: reschedule renovate bot
    Created 1 day ago. This PR adjusts the schedule for the Renovate bot to run before 1 AM on Mondays instead of before 2 AM. It is a minor change aimed at improving automation timing.

  • PR #443: fix: item_count double incremented
    Created 2 days ago. This PR addresses a bug where item_count was unexpectedly incremented when loaded from metadata. It includes a new test case but requires additional tests for thorough validation.

  • PR #431: fix: Relative URLS supports & Allow only http
    Created 7 days ago. This PR aims to enhance URL handling by replacing protocol-relative URLs and restricting supported protocols to HTTP and HTTPS. Review comments suggest that existing libraries could handle this functionality more cleanly.

  • PR #429: refactor!: RequestQueue and service management rehaul
    Created 7 days ago. A significant refactor intended to unify service management and improve the RequestQueue logic. Multiple review comments indicate a need for additional tests and potential simplifications in imports.

  • PR #410: feat: support custom profile in playwright
    Created 13 days ago. This feature allows users to specify a custom user profile directory when using Playwright, enhancing flexibility in browser automation tasks.

  • PR #167: ci: Use a local httpbin instance for tests
    Created 79 days ago (currently in draft). This PR proposes using a local instance of httpbin for testing purposes but has not progressed significantly since its creation.

Closed Pull Requests

Numerous closed PRs focus on dependency updates, documentation improvements, minor bug fixes, and CI/CD enhancements. Notable mentions include:

  • PR #446: chore(deps): update dependency setuptools to v73
    Merged recently, reflecting ongoing maintenance efforts to keep dependencies up-to-date.

  • PR #445: docs: remove-webinar
    A documentation update that removed outdated webinar information from the README file.

  • PR #444: chore(deps): update typescript-eslint monorepo to v8.2.0
    Another routine dependency update, indicating active maintenance of code quality tools.

Analysis of Pull Requests

The recent activity in the apify/crawlee-python repository demonstrates several key themes:

Focus on Tooling and Maintenance

A significant number of open and closed PRs are dedicated to tooling improvements and dependency updates. The presence of multiple PRs related to updates from Renovate indicates an automated approach to maintaining dependencies, which is crucial for long-term project health. For example, PRs like #446 and #444 show proactive steps taken by maintainers to ensure that the project remains compatible with the latest versions of critical libraries.

Bug Fixes and Feature Enhancements

Several open PRs directly address bugs or propose new features (e.g., PR #443 fixing the item_count issue and PR #410 adding support for custom profiles in Playwright). However, discussions surrounding these changes often highlight the need for additional testing or alternative approaches, as seen in PR #431 where contributors suggested leveraging existing libraries rather than introducing new checks.

Community Engagement

The comments on various PRs reflect an engaged community focused on code quality and best practices. Contributors are encouraged to add tests (as noted in multiple reviews), which indicates a collaborative environment where code reliability is prioritized. The discussions also reveal differing opinions on implementation strategies, particularly regarding URL handling in PR #431, showcasing healthy debate about the best solutions.

Anomalies

While there is robust activity in terms of merging PRs and addressing issues, some older PRs remain open or have been stalled (e.g., PR #167), which may indicate areas where contributors are less active or where there are unresolved discussions about implementation details. Additionally, the draft status of some older PRs suggests that contributors may be awaiting further input or resources before proceeding.

Conclusion

Overall, the pull request activity within the apify/crawlee-python repository illustrates a dynamic development environment with a strong emphasis on maintaining code quality through regular updates and community collaboration. However, it also highlights areas where further engagement or clarity may be needed to streamline contributions and enhance project momentum.

Report On: Fetch commits



Repo Commits Analysis

Development Team and Recent Activity

Team Members and Recent Contributions

  1. Vlada Dusek (vdusek)

    • Recent Activity:
    • Updated dependencies including setuptools and typescript-eslint.
    • Worked on documentation improvements and fixed broken links.
    • Contributed to the implementation of features related to HTTP clients and proxy management.
    • Collaborations: Frequently co-authored with Jan Buchar on various features and fixes.
  2. Jan Buchar (janbuchar)

    • Recent Activity:
    • Implemented significant features such as ParselCrawler, blocking detection for PlaywrightCrawler, and improvements in request handling.
    • Addressed multiple bugs, including fixing request dequeueing order and enhancing error handling in the CLI.
    • Collaborations: Regularly collaborated with Vlada Dusek and contributed to various branches focusing on core functionality.
  3. Saurav Jain (souravjain540)

    • Recent Activity:
    • Focused on documentation updates, including removing webinar information from the README and improving the website configuration.
    • Collaborations: Primarily worked independently but contributed to documentation alongside other team members.
  4. Martin Adámek (B4nan)

    • Recent Activity:
    • Engaged in documentation enhancements, fixing responsiveness issues, and improving overall content structure.
    • Collaborations: Worked closely with other developers on documentation-related tasks.
  5. Renovate Bot (renovate[bot])

    • Recent Activity:
    • Automated dependency updates across various files, ensuring that the project remains up-to-date with its dependencies.
    • Collaborations: Functions independently but integrates changes into the main repository.
  6. Apify Release Bot

    • Recent Activity:
    • Managed release processes, including updating changelogs and package versions.
    • Collaborations: Operates independently without direct collaboration.

Patterns and Themes

  • Focus on Dependency Management: A significant number of recent commits involve updating dependencies, indicating a proactive approach to maintaining code quality and security.
  • Documentation Improvements: Multiple team members have dedicated efforts towards enhancing documentation, which is crucial for user adoption and support.
  • Feature Development: Active contributions towards new features like ParselCrawler and enhancements in existing crawlers show a commitment to expanding functionality.
  • Bug Fixes and Enhancements: There is a consistent effort to address bugs, improve error handling, and refine user experience in CLI operations.
  • Collaboration Across Members: Frequent co-authorship among team members suggests a collaborative environment where knowledge sharing is encouraged.

Conclusions

The development team is actively engaged in both feature development and maintenance of the Crawlee project. The emphasis on dependency management, documentation, and collaborative efforts highlights a mature development process aimed at delivering a robust web scraping solution. The team's activities reflect a balance between introducing new capabilities while ensuring existing functionalities are stable and well-documented.