Turn entire websites into LLM-ready data

Y	Hacker News new \| ask \| show \| jobs

	Turn entire websites into LLM-ready data (firecrawl.dev)
	16 points by nickca 801 days ago

5 comments

What user agent does it use and do you obey robots.txt?

* apparently not as I tested it with LinkedIn.com which blocks most non-google/bing bots and it crawls it OK

> FireCrawl is built to navigate common web scraping challenges, including reverse proxies, rate limits, and caching

They probably ignore robots.txt

Do you have pay-as-you-go pricing? I feel that’s always missing from things like this. Cool otherwise.

This is cool, seems like a nice way to easily add context for ChatGPT at the least

* Creator here - Thats the goal!

And you honor or ignore robots.txt?

It wasn't in our initial version (we didn't plan on launching today), but we are pushing an update to do so now.

Does this respect any anti AI related scraping rules set forth by website owners?

It doesn't even respect general robots.txt exclusions, so... I doubt it.

It crawls webpages (finds subdirectories), handles JS blocking with fallbacks to headless browsers, and does this all concurrently.

If only that script worked for every website. But, alas, it does not.