The mendableai/firecrawl
project is an advanced tool designed to convert websites into LLM-ready markdown or structured data using sophisticated scraping and crawling techniques. Managed by Mendable AI, the project has garnered significant attention with over 22,373 stars on GitHub, reflecting its popularity and utility. The project is in a robust state with active development and a trajectory focused on enhancing web data extraction capabilities for AI applications.
Gergő Móricz (mogery)
Nicolas (nickscamara)
Rafael Miller (rafaelsideguide)
Eric Ciarla (ericciarla)
Thomas Kosmas (tomkosm)
Dependabot[bot]
Timespan | Opened | Closed | Comments | Labeled | Milestones |
---|---|---|---|---|---|
7 Days | 4 | 1 | 2 | 2 | 1 |
30 Days | 17 | 7 | 31 | 5 | 1 |
90 Days | 72 | 41 | 155 | 14 | 1 |
All Time | 432 | 322 | - | - | - |
Like all software activity quantification, these numbers are imperfect but sometimes useful. Comments, Labels, and Milestones refer to those issues opened in the timespan in question.
Developer | Avatar | Branches | PRs | Commits | Files | Changes |
---|---|---|---|---|---|---|
Nicolas | ![]() |
3 | 5/5/0 | 85 | 93 | 55956 |
None (dependabot[bot]) | 5 | 13/0/8 | 5 | 5 | 10601 | |
Rafael Miller (rafaelsideguide) | 3 | 1/0/1 | 17 | 24 | 2749 | |
Gergő Móricz | ![]() |
2 | 0/1/0 | 53 | 47 | 1867 |
Thomas Kosmas (tomkosm) | 1 | 1/0/0 | 1 | 3 | 272 | |
Eric Ciarla | ![]() |
1 | 0/0/0 | 3 | 2 | 181 |
CyberKn1ght (aanokh) | 0 | 1/0/0 | 0 | 0 | 0 | |
Ademílson Tonato (ftonato) | 0 | 0/0/1 | 0 | 0 | 0 | |
aee (aeelbeyoglu) | 0 | 1/0/1 | 0 | 0 | 0 | |
Hercules Smith (ProfHercules) | 0 | 1/0/0 | 0 | 0 | 0 |
PRs: created by that dev and opened/merged/closed-unmerged during the period
Risk | Level (1-5) | Rationale |
---|---|---|
Delivery | 4 | The project faces significant delivery risks due to a backlog of unresolved issues and critical open issues like #1082 and #1071. The volume of changes in pull requests, such as PR #1083, and the lack of comprehensive testing and documentation further exacerbate these risks. Additionally, dependency management challenges, particularly with self-hosting configurations, could impact delivery timelines. |
Velocity | 4 | The project's velocity is at risk due to the accumulation of unresolved issues and the high volume of open pull requests. While there is active development, as seen in PR #1083 and recent commits by key contributors, the uneven distribution of workload among team members could lead to bottlenecks. The reliance on a few developers for major contributions also poses a risk to sustained velocity. |
Dependency | 3 | Dependency risks are moderate, with active management through dependabot updates (e.g., PRs #1079, #1078). However, the integration challenges posed by numerous dependency changes could impact stability if not thoroughly tested. Issues like #1082 highlight potential risks from external system dependencies. |
Team | 3 | The team faces potential risks related to uneven workload distribution and possible burnout among key contributors like Nicolas and Gergő Móricz. The need for improved documentation and communication, as indicated by user feedback, suggests challenges in team collaboration and onboarding processes. |
Code Quality | 4 | Code quality is at risk due to the concentration of contributions among a few developers and the lack of comprehensive testing details in pull requests like PR #1083. Frequent changes to critical files without thorough reviews could lead to technical debt. The presence of unresolved bugs (#1071) further indicates potential quality issues. |
Technical Debt | 4 | Technical debt is accumulating due to unresolved issues and frequent modifications to core files like 'extraction-service.ts'. The lack of detailed documentation and test coverage in significant feature additions (e.g., PR #1047) contributes to this risk. Ongoing development without addressing these gaps could exacerbate technical debt. |
Test Coverage | 4 | Test coverage is insufficient, as indicated by the lack of explicit testing details in major pull requests like PR #1083. While some unit tests exist (e.g., 'mix-schemas.test.ts'), they may not cover all edge cases or integration scenarios, increasing the risk of undetected bugs. |
Error Handling | 3 | Error handling mechanisms are present but may not be comprehensive enough to catch all potential failures, especially given the reliance on asynchronous operations and external API calls in critical components like 'extraction-service.ts'. The ongoing improvements in logging and error handling are positive but require further enhancement. |
Recent activity in the mendableai/firecrawl
GitHub repository shows a diverse range of issues being reported and addressed. The project has an active community with numerous open issues, indicating ongoing development and user engagement. Key themes include self-hosting challenges, feature requests for enhanced scraping capabilities, and bug reports related to specific website interactions.
Notable anomalies include recurring issues with self-hosted deployments, particularly around Redis configuration and resource management. There are also several reports of unexpected behavior when dealing with dynamic content or specific web technologies like Cloudflare protection. Additionally, users have expressed interest in more robust proxy support and better handling of authentication-required pages.
A significant portion of the issues revolves around enhancing the tool's ability to handle complex web environments, such as those requiring JavaScript execution or dealing with nested sitemaps. There's also a clear demand for improved documentation and examples to aid new users in setting up and utilizing Firecrawl effectively.
gpt-4o
does not exist or you do not have access to it." - Updated 11 days ago; Status: Open; Priority: LowThese issues highlight ongoing challenges in both the hosted and self-hosted environments, particularly around integration with other tools and services, as well as handling complex web structures. The project's maintainers are actively engaging with the community to address these concerns, indicating a responsive development process.
mendableai/firecrawl
#1083: Rafa/extract options
#1079: apps/test-suite(deps-dev): bump the dev-deps group in /apps/test-suite with 3 updates
@types/jest
, artillery
, and typescript
. Keeping dependencies up-to-date is crucial for security and performance.#1078: apps/test-suite(deps): bump the prod-deps group in /apps/test-suite with 7 updates
@anthropic-ai/sdk
and openai
.#1077: apps/api(deps-dev): bump the dev-deps group in /apps/api with 12 updates
#1076: apps/api(deps): bump the prod-deps group in /apps/api with 47 updates
@supabase/supabase-js
and mongoose
.#1073: (feat/index) Index/Insertion queue
#1072: (feat/formats) Extract format renamed to json format
#1068: (feat/extract) - LLMs usage analysis + billing
extraction-service.ts
analyzeSchemaAndPrompt
and performExtraction
are logically structured, each handling specific tasks in the extraction process.ExtractServiceOptions
and ExtractResult
enhances type safety. However, the file is quite lengthy (834 lines), which could be broken down into smaller modules for better maintainability.crawler.ts
WebCrawler
) encapsulates crawling logic effectively. Methods like filterLinks
, getRobotsTxt
, and extractLinksFromHTML
are well-defined.map.ts
getMapResults
being the core function handling URL mapping logic.mix-schema-objs.ts
mergeResults
effectively handles nested structures within schemas.index.ts
(Scraping Engines)Engine
) and feature flags (FeatureFlag
) provides clarity. Functions like buildFallbackList
are well-designed to handle engine selection logic.Consistency in Style: Across all files, there is consistent use of TypeScript features such as interfaces, enums, and async/await for asynchronous operations.
Modularity: While some files are lengthy (e.g., extraction-service.ts
), they generally follow good modular design principles. Further decomposition could improve readability.
Error Handling & Logging: Comprehensive error handling combined with detailed logging provides robustness against runtime errors and aids in troubleshooting.
Integration & Dependencies: There is strong integration between different parts of the codebase, indicating a well-thought-out architecture that supports complex functionalities like web crawling and data extraction.
Potential Improvements:
Overall, the codebase demonstrates a high level of sophistication suitable for a project focused on web scraping and data extraction at scale.
Active Development: The project is under active development with frequent commits from key contributors like Gergő Móricz and Nicolas. There is a strong focus on improving existing functionalities such as sitemap handling, logging, and extraction services.
Collaborative Efforts: Team members often collaborate on merges and feature implementations, indicating a cohesive development process. This is evident in the frequent merges between branches managed by different developers.
Feature Enhancements: Recent activities highlight ongoing efforts to enhance the project's capabilities, particularly in extraction services, logging improvements, and sitemap functionality.
Dependency Management: Regular updates by Dependabot indicate a proactive approach to maintaining dependency health within the project.
Diverse Contributions: While some team members focus heavily on core functionalities (e.g., Gergő Móricz), others contribute through examples or specific enhancements (e.g., Eric Ciarla).
Overall, the Firecrawl project demonstrates a dynamic development environment with continuous improvements and active collaboration among team members.