OSS Report: apify/crawlee-python

Sept. 20, 2024, 4:30 a.m. UTC This report was generated by Dispatch AI

Crawlee Python Expands Capabilities with HTTP/2 Support and Enhanced Documentation

Crawlee is an open-source web scraping and browser automation library for Python, developed by Apify. Over the past month, the project has seen significant improvements in its core functionality and user experience.

The development team has focused on expanding Crawlee's capabilities, with notable additions including HTTP/2 support for the HTTPX client, improved request handling, and enhanced documentation. These changes reflect a concerted effort to modernize the library and make it more accessible to users.

Recent Activity

Recent issues and pull requests indicate a focus on performance improvements, error handling, and expanding integration options. For instance, there's ongoing work on implementing fingerprint injection for the Playwright crawler (#401) and discussions about using Redis for distributed crawling (#536). These efforts suggest a trajectory towards more advanced scraping capabilities and improved scalability.

The development team's recent activities include:

Vlada Dusek (vdusek):
- Added support for filling web forms
- Integrated proxies into PlaywrightCrawler
- Currently working on a playwright fingerprint injector
Jan Buchar (janbuchar):
- Implemented ParselCrawler for Parsel support
- Fixed issues with request queue ordering and deduplication
Jindřich Bär (barjin):
- Focused on documentation and dependency management
- Fixed issues with Docusaurus versioning
Various contributors:
- Added HTTP/2 support for HTTPX client (#513)
- Improved docstrings and documentation (#534, #521)
- Fixed UTF-8 encoding in local storage (#533)

Of Note

The addition of HTTP/2 support significantly modernizes Crawlee's networking capabilities.
The ongoing work on fingerprint injection for Playwright crawler could enhance the library's ability to avoid detection during scraping.
The project's focus on documentation improvements, including new guides, reflects a commitment to user accessibility.
The discussion about Redis integration for distributed crawling (#536) suggests potential future scalability enhancements.
The consistent effort to align the Python version with its JavaScript counterpart indicates a strategic approach to maintaining feature parity across different implementations of Crawlee.

Quantified Reports

Quantify Issues

Recent GitHub Issues Activity

Timespan	Opened	Closed	Comments	Labeled	Milestones
7 Days	5	6	11	0	2
30 Days	30	27	29	0	3
90 Days	94	69	126	3	7
1 Year	179	113	203	5	20
All Time	180	113	-	-	-

_{Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.}

Quantify commits

Quantified Commit Activity Over 30 Days

Developer	Branches	PRs	Commits	Files	Changes
Jindřich Bär	2	3/3/0	6	6	125811
Vlada Dusek	2	24/25/0	26	262	7551
Jan Buchar	1	13/14/0	14	52	2272
renovate[bot]	1	4/3/1	3	3	259
Apify Release Bot	1	0/0/0	24	2	88
MS_Y	1	0/1/0	1	5	47
Daniel Wébr	1	1/1/0	1	1	14
Mat	1	0/1/0	1	2	13
Gianluigi Tiesi (sherpya)	0	0/0/1	0	0	0

_{PRs: created by that dev and opened/merged/closed-unmerged during the period}

Detailed Reports

Report On: Fetch issues

Based on the provided GitHub Issues data for the Crawlee Python project, here is an analysis of the recent activity and key issues:

Recent Activity Analysis:

The Crawlee Python project has seen significant recent activity, with numerous issues being created, discussed, and resolved over the past few months. Many of these issues relate to implementing new features, improving existing functionality, and addressing user feedback.

Notable issues and themes include:

Implementing new features:
- There are ongoing efforts to add support for various HTTP clients, including curl-cffi (#292) and improving HTTP/2 support (#512).
- Work is being done on fingerprint injection for the Playwright crawler (#401) to improve scraping capabilities.
- New guides are being created for various aspects of the library (#477, #478, #479, #480, #481).
Performance and scaling:
- Issues related to request queue performance (#203) and handling large numbers of requests (#487) indicate a focus on improving scalability.
- There's ongoing work on implementing max crawl depth (#460) and improving concurrency settings.
Error handling and edge cases:
- Several issues deal with improving error handling, such as for 4xx status codes (#496) and keyboard interrupts (#212).
- There are discussions about better handling of non-standard URLs and improving URL validation (#417).
Documentation and user experience:
- Multiple issues focus on improving documentation, including creating new guides and enhancing existing ones (#266, #304, #305).
- There's an effort to improve the CLI experience (#267) and project bootstrapping (#215).
Integration with other tools:
- Work is being done to integrate with tools like Redis for distributed crawling (#536) and potentially adding Selenium support (#284).
Compatibility and consistency:
- There are efforts to make the Python version more consistent with the JavaScript version of Crawlee, such as implementing features like use_state (#191) and improving changelog generation (#18).

Issue Details:

Most recently created issues: 1. #536: "Can Crawlee use Redis to build a distributed crawler?" (created 1 day ago, open) 2. #532: "Example code beautifulsoup_crawler.py not working on Windows due to encoding assumptions" (created 2 days ago, closed) 3. #526: "Implement/document a way how to pass extra configuration to json.dump()" (created 4 days ago, open) 4. #524: "Implement/document a way how to pass information between handlers" (created 4 days ago, open)

Most recently updated issues: 1. #536: "Can Crawlee use Redis to build a distributed crawler?" (updated 1 day ago, open) 2. #532: "Example code beautifulsoup_crawler.py not working on Windows due to encoding assumptions" (updated 1 day ago, closed) 3. #526: "Implement/document a way how to pass extra configuration to json.dump()" (updated 3 days ago, open) 4. #524: "Implement/document a way how to pass information between handlers" (updated 3 days ago, open)

The project appears to be actively developed with a focus on expanding features, improving performance, and enhancing user experience. There's a strong emphasis on documentation and addressing user feedback, which suggests a commitment to making the library more accessible and robust.

Report On: Fetch pull requests

Overview

The analysis of the pull requests (PRs) for the apify/crawlee-python repository reveals a dynamic and active development environment. The PRs cover a wide range of topics including feature enhancements, bug fixes, documentation improvements, and dependency updates. The project's focus on continuous improvement is evident through the regular updates and refinements made to both its core functionality and its documentation.

Summary of Pull Requests

Open Pull Requests

PR #167: A draft PR aimed at using a local httpbin instance for tests. It's significant as it addresses issue #160 by potentially improving the testing environment's reliability and speed.

Closed Pull Requests

PR #534: Improved docstring of the Request class, enhancing documentation clarity.
PR #533: Fixed UTF-8 encoding in local storage, addressing issue #532 and ensuring consistent data handling.
PR #530: Introduced a header generator integrated into the HTTPX client, enhancing request customization capabilities.
PR #529: Modified CI workflow to allow beta releases for chore commits, streamlining release processes.
PR #528 & #527: Routine dependency updates managed by Renovate bot, ensuring the project uses up-to-date libraries.
PR #521: Added a request storage guide, improving user documentation and helping users understand request management better.
PR #519: Removed non-existing typer extra from dependencies, cleaning up configuration files.
PR #518: Refactored template structures to align with Actor templates, improving consistency across project templates.
PR #515: Exposed extended unique key functionality in Request.from_url, enhancing request handling capabilities.
PR #513: Added HTTP/2 support for HTTPX client, improving performance and compatibility with modern web standards.
PR #508: Corrected log level configuration handling, ensuring accurate logging behavior based on user settings.

Analysis of Pull Requests

The analysis of the PRs indicates several key themes and areas of focus within the apify/crawlee-python project:

Continuous Improvement and Feature Expansion:
- The introduction of features like HTTP/2 support (#513) and a header generator (#530) demonstrates an ongoing effort to enhance the library's capabilities. These features are crucial for keeping up with modern web standards and providing users with more powerful tools for web scraping.
Documentation and Usability Enhancements:
- PRs such as #534 (improving docstrings) and #521 (adding request storage guide) highlight a strong emphasis on documentation. This is vital for open-source projects as it directly impacts user adoption and satisfaction by making it easier for new users to understand and utilize the library effectively.
Dependency Management and CI/CD Improvements:
- Routine dependency updates (#528 & #527) managed by Renovate bot reflect good maintenance practices. Additionally, changes to CI workflows (#529) indicate efforts to streamline development processes, making them more efficient and less error-prone.
Bug Fixes and Refinements:
- Several PRs address specific issues or bugs (#533 fixing UTF-8 encoding issue, #508 correcting log level configuration). This attention to detail is important for maintaining software quality and reliability.
Refactoring for Consistency and Clarity:
- Refactoring efforts (#518 aligning template structures with Actor templates) suggest an ongoing commitment to code quality and consistency across different parts of the project. This not only improves maintainability but also enhances the developer experience by providing a more uniform codebase.

In conclusion, the apify/crawlee-python project exhibits a robust development activity characterized by feature enhancements, meticulous attention to documentation, proactive dependency management, thorough bug fixing, and continuous refactoring efforts. These practices are indicative of a well-managed open-source project that prioritizes both user satisfaction and developer experience.

Report On: Fetch commits

Based on the provided information, here's an analysis of the recent activities of the Crawlee Python development team:

Development Team and Recent Activity

Team Members and Recent Contributions

Vlada Dusek (vdusek):
- Most active contributor in the last 30 days with 26 commits
- Recent work:
- Added support for filling web forms
- Integrated proxies into PlaywrightCrawler
- Improved documentation and examples
- Working on a playwright fingerprint injector (in progress)
Jan Buchar (janbuchar):
- Second most active contributor with 14 commits
- Recent work:
- Implemented ParselCrawler for Parsel support
- Fixed issues with request queue ordering and deduplication
- Improved error handling and logging
Jindřich Bär (barjin):
- 6 commits, mostly focused on documentation and dependency management
- Fixed issues with Docusaurus versioning and documentation builds
Apify Release Bot:
- Automated 24 commits for version updates and changelog management
Other contributors (renovate[bot], webrdaniel, cadlagtrader, black7375):
- Various smaller contributions including dependency updates, documentation improvements, and bug fixes

Recent Features and Bug Fixes

New Features:
- Added support for filling and submitting web forms
- Implemented ParselCrawler for Parsel library support
- Added blocking detection for PlaywrightCrawler
- Integrated proxies into PlaywrightCrawler
- Exposed crawler log for better debugging
Bug Fixes:
- Fixed request queue ordering issues
- Improved error handling for context pipeline errors
- Fixed Pylance reportPrivateImportUsage errors
- Addressed issues with project bootstrapping in existing folders
Improvements:
- Enhanced URL validation and handling
- Improved documentation, including installation instructions and API references
- Optimized HTTP client logging
- Refactored code for better type hinting and consistency

Ongoing Work

Vlada Dusek is currently working on a playwright fingerprint injector (branch: add-fingerprint-injector)
There's ongoing work on improving documentation and fixing build issues related to Docusaurus versioning

Patterns and Themes

Focus on Usability:
- Recent changes aim to make the library more user-friendly, with improved documentation and easier setup processes
Browser Automation Enhancements:
- Significant work on PlaywrightCrawler, including proxy integration and blocking detection
Expanding Functionality:
- Addition of ParselCrawler and web form support shows efforts to broaden the library's capabilities
Code Quality and Maintenance:
- Consistent efforts to improve type hinting, fix bugs, and refactor code for better maintainability
Community Engagement:
- Multiple contributors involved, including addressing community-reported issues and feature requests

The development team appears to be actively improving Crawlee Python, with a focus on enhancing its core functionality, improving user experience, and maintaining code quality. The project shows regular activity with frequent releases and a mix of feature development and bug fixing.