| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by RamblingCTO 105 days ago
	Doesn't work for pages protected by cloudflare in my experience. What a shame, they could've produced the problem and sold the solution.

8 comments

paxys 105 days ago

That’s what they are doing. This is a textbook protection racket.

“Buy Cloudflare bot protection, otherwise it would be a shame if your site got scraped and ddos’d.”

Who is doing the scraping and ddosing? Cloudflare.

link

tracker1 105 days ago

In this case, sure... that said, I've worked on a few sites where more than half the traffic was bots because the content was useful for other sites (classic car classifieds/sales site). The fact that just over half the page requests were actually search query results is what meant a lot of optimization steps in practice... Implementing a "search" database (mongodb and elastic were pretty new at the time), denormalizing a lot of the data structures on the "enterprise" SQL structures for search and display for not logged in users, etc. Heavier caching, donut caching, etc.

It was an interesting and sometimes fun part of my career. Working on a site/application that isn't necessarily a tech site, and that I have a personal interest in was pretty great... some of the pace for sales/commercial features less so, with sales making deals requiring deep integrations on impossible timelines. You learn a lot when a self-hosted site is being kicked while it's down... The cloud migration to get a better use of flexible resources, etc.

link

kentonv 105 days ago

You can trivially block Cloudflare crawl via robots.txt. You don't need to buy Cloudflare's bot protection -- this is not a malicious bot.

https://x.com/CloudflareDev/status/2031745285517455615

(Disclosure: I work for Cloudflare but not on this product. I get pretty tired of the conspiracy theories TBH.)

link

tyingq 105 days ago

That's too funny. If true, really looking forward to the Cloudflare response here. I'm unsure how you would spin that in a way that didn't seem self-serving.

link

morpheuskafka 105 days ago

It's very clearly disclosed in the linked docs already, it says that Cloudflare Bot Protection will block it same as all other bots, unless you choose to allow it as an exception. If they didn't do it that way, people would accuse them of either bypassing their own product (possibly anticompetitive) or just having a low quality one.

link

tyingq 105 days ago

So it doesn't take any action to work around other bot protections? Feels like that would be on the list of features an AI company wanting to scrape would ask for.

link

kentonv 105 days ago

No, it does not take any action to work around other bot protections.

https://x.com/CloudflareDev/status/2031745285517455615

(Disclosure: I work for Cloudflare but not on this product.)

link

kentonv 105 days ago

Cloudflare crawl respects robots.txt. It does not attempt to bypass any anti-crawling measures. If the site doesn't want to be crawled -- whether it uses Cloudflare or not -- this product will not help you crawl it.

Some sites actually want crawlers -- e.g. sites that are selling a product, documentation, etc. That's what this product is meant for.

https://x.com/CloudflareDev/status/2031745285517455615

(Disclosure: I work for Cloudflare but not on this product.)

link

GodelNumbering 105 days ago

I imagine that would cause a backlash from the website owners trusting cloudflare to keep their content 'safe'

link

chvid 105 days ago

As long at it gets past Azure's bot protection ...

link

antonyh 105 days ago

Wait. What?

Is this just a way to strong-arm non-cloudflarians into adopting their platform if you don't want your site crawled? It does sound like they are selling the solution to avoid their own content crawler.

link

davidhariri 105 days ago

Came here to write this. I am getting much better results from Firecrawl (not affiliated with them, just a happy customer).

link

oasisbob 105 days ago

As someone who helps keep a site online with a lot of content, I have mixed feelings on Firecrawl.

On one hand, their bots seem much more well behaved than others.

However, running a crawler fleet which is deceptive and evasive in its identification and don't honor REP is no way to build a business.

link

kordlessagain 105 days ago

I'd love for you to kick the tires on https://grubcrawler.dev

link

RamblingCTO 105 days ago

fuck firecrawl. they copied my idea by showing interest in my product and then copied it, used their YC money to give it all out for free. fuck nick in particular. I'm still salty over this

link

xeornet 105 days ago

"they copied my idea by showing interest in my product and then copied it". What exactly is revolutionary about Firecrawl or your product? Scraping APIs have been around for over a decade.

link

RamblingCTO 105 days ago

I was the first to return markdown and use reader mode stuff to strip irrelevant stuff. Theres copying and there's talking to the founder sounding interested to have your team copy what I did in the background. One is fair game, the other is a dick head move.

link

xeornet 105 days ago

Not sure about the first claim. But yes, talking to the founder, sharing details and having it stolen is not a good look. Sorry that happened to you.

link

keeda 105 days ago

I think that is a neat idea and it sucks this happened, but how long before somebody simply saw that feature and replicated it? I'm curious, had you considered a deeper moat than that?

This is especially relevant given AI is making this kind of thing easy at an industrial scale. I think we should all be looking for alternative moats.

link

gopher_space 105 days ago

Sometimes timing is your moat and that's all you need. That being said I'll probably start limiting my public releases to revolve around standards I want implemented.

I'm rethinking the sources of value moats are built around. It seems like the landscape is changing and dimensions such as location, perspective, experience, and attention weigh more than they used to.

> but how long before somebody simply saw that feature and replicated it?

This is a good example. The, idk, "value store" of your org just switched from products and services to the employees who understand your process from a couple angles and can write well.

link

neversupervised 105 days ago

Tell more. Crawling is not a new idea. How did they abuse you?

link

ekropotin 105 days ago

Please tells me you are joking

link