| Hey HN! Caleb, Nick, Garrett, and I from Mendable (YC S22) are excited to launch Intelligent Extraction for FireCrawl, the developer platform for scraping, search, and extraction. After a successful twitter launch last week, FireCrawl skyrocketed to over 2k stars and we have been getting a ton of feature requests [1]. One that stood out to us in particular was using the data we scrape to extract different types of structured metadata. Think querying “Is this company open source?” to a list of URLs and getting structured JSON back. Here’s a taste of what the request format / response looks like what scraping and extracting data from Firecrawl.dev Request format:
{
"company_mission": {
"type": "string"
},
"supports_sso": {
"type": "boolean"
},
"is_open_source": {
"type": "boolean"
}
},
"required": [
"company_mission",
"supports_sso",
"is_open_source"
]
} Response format:
{
"company_mission":"transform any website into clean, LLM-ready markdown",
"Supports_sso":false,
"Is_open_source":true
} The technical implementation for Intelligent Extraction involved:
1. Use Firecrawl to gather content as markdown
2. Use gpt-4 function calling to boil content down into a structured format. Inspiration was drawn from Simon Willison [2] and Mish Ushakov of llm-scraper [3]: This is just the beginning as we launched FireCrawl about a week ago so we expect a great deal of work will be required to make this as reliable and extendable as we envision! Any feedback would be highly appreciated [4]. [1] https://github.com/mendableai/firecrawl
[2] https://til.simonwillison.net/gpt3/openai-python-functions-d...
[3] https://github.com/mishushakov/llm-scraper/
[4] https://console.algora.io/org/mendableai |