The apify/crawlee-python
project is actively addressing critical bugs and enhancements, with particular attention to issues like the item_count
double increment (#442) and URL validation problems (#417), which are crucial for maintaining user trust and functionality. Crawlee is a Python library designed for web scraping and browser automation, enhancing data extraction capabilities with tools like BeautifulSoup and Playwright.
Recent issues and pull requests (PRs) reflect a concentrated effort on improving the library's core functionalities and user experience. The critical bugs, such as the item_count
issue (#442), are being tackled alongside enhancements like memory management configurability (#434). This suggests a dual focus on immediate problem resolution and long-term capability expansion.
Vlada Dusek (vdusek)
Jan Buchar (janbuchar)
ParselCrawler
and improved request handling.Saurav Jain (souravjain540)
Martin Adámek (B4nan)
Renovate Bot (renovate[bot])
Apify Release Bot
ParselCrawler
reflect ongoing development.Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
renovate[bot] | 2 | 32/30/3 | 31 | 5 | 4301 | |
Jan Buchar | 2 | 27/28/1 | 34 | 55 | 3108 | |
Vlada Dusek | 2 | 11/9/1 | 10 | 71 | 2448 | |
Martin Adámek | 1 | 0/0/0 | 2 | 2 | 1213 | |
asymness | 1 | 1/1/0 | 1 | 8 | 469 | |
Apify Release Bot | 1 | 0/0/0 | 12 | 2 | 448 | |
Saurav Jain | 1 | 6/6/0 | 6 | 5 | 40 | |
TymeeK | 1 | 0/1/0 | 1 | 4 | 34 | |
Fauzaan Gasim | 1 | 1/1/0 | 1 | 1 | 2 | |
Gianluigi Tiesi (sherpya) | 0 | 1/0/0 | 0 | 0 | 0 | |
MS_Y (black7375) | 0 | 1/0/0 | 0 | 0 | 0 | |
Mat (cadlagtrader) | 0 | 1/0/0 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 5 | 1 | 3 | 0 | 2 |
30 Days | 27 | 20 | 47 | 3 | 2 |
90 Days | 80 | 56 | 97 | 3 | 7 |
1 Year | 148 | 86 | 162 | 5 | 18 |
All Time | 149 | 87 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
The recent activity in the apify/crawlee-python
GitHub repository indicates a vibrant development environment, with 62 open issues and ongoing discussions around enhancements and bug fixes. Notably, issues related to bugs and enhancements in tooling are prevalent, suggesting a focus on improving the library's functionality and user experience. There are several critical bugs, such as the item_count
double increment issue (#442) and URL validation problems (#417), which could significantly impact users if not addressed promptly.
Common themes among the issues include enhancements to memory management, request handling, and documentation improvements. The presence of multiple enhancement requests indicates that the community is actively seeking to expand Crawlee's capabilities, particularly in areas like configuration options and performance optimizations.
Issue #442: item_count double incremented when reloading dataset
item_count
to increment incorrectly when reusing datasets with metadata, leading to inconsistencies in data handling.Issue #434: Make memory-related parameters of Snapshotter configurable via Configuration
Issue #433: Unify crawlee.memory_storage_client.request_queue_client with JS counterpart
Issue #427: Does the crawlee-python support preNavigationHooks?
Issue #417: URL Validation edge case - Protocol/Scheme relative URLs
Issue #354: Crawling very slow and timeout error
Issue #304: Improve API docs of public components
Issue #203: Request fetching from RequestQueue is sometimes very slow
The issues reflect a mix of urgent bugs that could hinder user experience and ongoing enhancements aimed at expanding functionality. The presence of critical bugs related to data handling and performance suggests that immediate attention is required to maintain user trust and satisfaction in this rapidly evolving project. The community's active engagement in proposing enhancements indicates a strong interest in improving Crawlee's capabilities further.
The analysis of the pull requests (PRs) for the apify/crawlee-python
repository reveals a total of 6 open PRs and 289 closed PRs. The recent activity indicates a focus on tooling improvements, bug fixes, and dependency updates, with notable discussions around code quality and testing practices.
PR #447: chore: reschedule renovate bot
Created 1 day ago. This PR adjusts the schedule for the Renovate bot to run before 1 AM on Mondays instead of before 2 AM. It is a minor change aimed at improving automation timing.
PR #443: fix: item_count double incremented
Created 2 days ago. This PR addresses a bug where item_count
was unexpectedly incremented when loaded from metadata. It includes a new test case but requires additional tests for thorough validation.
PR #431: fix: Relative URLS supports & Allow only http
Created 7 days ago. This PR aims to enhance URL handling by replacing protocol-relative URLs and restricting supported protocols to HTTP and HTTPS. Review comments suggest that existing libraries could handle this functionality more cleanly.
PR #429: refactor!: RequestQueue and service management rehaul
Created 7 days ago. A significant refactor intended to unify service management and improve the RequestQueue logic. Multiple review comments indicate a need for additional tests and potential simplifications in imports.
PR #410: feat: support custom profile in playwright
Created 13 days ago. This feature allows users to specify a custom user profile directory when using Playwright, enhancing flexibility in browser automation tasks.
PR #167: ci: Use a local httpbin instance for tests
Created 79 days ago (currently in draft). This PR proposes using a local instance of httpbin for testing purposes but has not progressed significantly since its creation.
Numerous closed PRs focus on dependency updates, documentation improvements, minor bug fixes, and CI/CD enhancements. Notable mentions include:
PR #446: chore(deps): update dependency setuptools to v73
Merged recently, reflecting ongoing maintenance efforts to keep dependencies up-to-date.
PR #445: docs: remove-webinar
A documentation update that removed outdated webinar information from the README file.
PR #444: chore(deps): update typescript-eslint monorepo to v8.2.0
Another routine dependency update, indicating active maintenance of code quality tools.
The recent activity in the apify/crawlee-python
repository demonstrates several key themes:
A significant number of open and closed PRs are dedicated to tooling improvements and dependency updates. The presence of multiple PRs related to updates from Renovate indicates an automated approach to maintaining dependencies, which is crucial for long-term project health. For example, PRs like #446 and #444 show proactive steps taken by maintainers to ensure that the project remains compatible with the latest versions of critical libraries.
Several open PRs directly address bugs or propose new features (e.g., PR #443 fixing the item_count
issue and PR #410 adding support for custom profiles in Playwright). However, discussions surrounding these changes often highlight the need for additional testing or alternative approaches, as seen in PR #431 where contributors suggested leveraging existing libraries rather than introducing new checks.
The comments on various PRs reflect an engaged community focused on code quality and best practices. Contributors are encouraged to add tests (as noted in multiple reviews), which indicates a collaborative environment where code reliability is prioritized. The discussions also reveal differing opinions on implementation strategies, particularly regarding URL handling in PR #431, showcasing healthy debate about the best solutions.
While there is robust activity in terms of merging PRs and addressing issues, some older PRs remain open or have been stalled (e.g., PR #167), which may indicate areas where contributors are less active or where there are unresolved discussions about implementation details. Additionally, the draft status of some older PRs suggests that contributors may be awaiting further input or resources before proceeding.
Overall, the pull request activity within the apify/crawlee-python
repository illustrates a dynamic development environment with a strong emphasis on maintaining code quality through regular updates and community collaboration. However, it also highlights areas where further engagement or clarity may be needed to streamline contributions and enhance project momentum.
Vlada Dusek (vdusek)
setuptools
and typescript-eslint
.Jan Buchar (janbuchar)
ParselCrawler
, blocking detection for PlaywrightCrawler
, and improvements in request handling.Saurav Jain (souravjain540)
Martin Adámek (B4nan)
Renovate Bot (renovate[bot])
Apify Release Bot
ParselCrawler
and enhancements in existing crawlers show a commitment to expanding functionality.The development team is actively engaged in both feature development and maintenance of the Crawlee project. The emphasis on dependency management, documentation, and collaborative efforts highlights a mature development process aimed at delivering a robust web scraping solution. The team's activities reflect a balance between introducing new capabilities while ensuring existing functionalities are stable and well-documented.