| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jpablo 540 days ago
	They aren't blocking anything. They are just asking nicely not to be crawled. Given that AI companies haven't cared a single bit about ripping of other's peoples data I don't see why they would care now.

5 comments

wing-_-nuts 540 days ago

A number of sites have started outright blocking any traffic that looks remotely suspicious. This has made browsing with a vpn a bit of a pain.

link

pixl97 540 days ago

This has been ever increasing for years now. Bots, attacks, scrapers, AI, all these things seem to be the majority of traffic on most sites.

link

superluserdo 540 days ago

I wish I could go back to the days of doing almost anything at all without having to tell a server what a motorbike or traffic light is.

link

wing-_-nuts 539 days ago

LPT: switch to the audio captcha. Yes, it takes a bit longer than if you did one grid captcha perfectly, but I never have to sit there and wonder if a square really has a crosswalk or not, and I never wind up doing more than one.

link

EVa5I7bHFq9mnYK 540 days ago

In their attempt to block OpenAI, they block me. Many sites that were accessible just 2 years ago, require login/captchas/rectal exam now just to read the content.

link

ammanley 540 days ago

Im looking forward to the life experience that is content I want to read badly enough to endure a rectal exam.

link

EVa5I7bHFq9mnYK 540 days ago

It's not that bad ...

link

fennecbutt 532 days ago

Not sure why you're being downvoted. Watching str8 bois react with shock and horror at the idea of anything near their butt is hilarious.

Prostate and rectal cancer is real, boys. Grow tf up about it.

link

josu 540 days ago

> captchas

I suspect that AIs are already more effective than humans at passing captchas.

link

EVa5I7bHFq9mnYK 540 days ago

That would be an example of AI providing real value that I would pay for.

link

heavyset_go 540 days ago

These exist for a fee if you want to use them

link

EVa5I7bHFq9mnYK 539 days ago

I used 2captcha, for a fee ... it doesn't work

link

kjkjadksj 540 days ago

They block plenty and they do it crudely. I get suspicious traffic bans from reddit all the time. Trivial enough to route around by switching user agent however. Which goes to show any crawling bot writer worth their salt already routes around reddit and most other sites bs by now. I’m just the one getting the occasional headache because I use firefox and block ads and site tracking I guess.

link

njovin 540 days ago

Wouldn't it be somewhat trivial to set up honeypots?

link

jaybna 540 days ago

Yeah, probably right. If you want a great rabbit hole, look up "Common Crawl" and see how a great academic project was absolutely hijacked for pennies on the dollar to grab training data - the foundation for every LLM out there right now.

link

CamperBob2 540 days ago

It's hard to envision a greater success for the "great academic project" than what happened. I mean, what else were they trying to accomplish?

link

jaybna 540 days ago

It was meant to be an open-source compilation of the crawled internet so that research could be done on web search given how opaque Google's process is. It was NOT meant to be a cheap source of data for for-profit LLM's to train on.

*edit: added "for-profit"

link

CamperBob2 539 days ago

(Shrug) Multiple not-for-profit LLMs have trained on it as well.

If something I worked on turned out to play a significant part in something that turned out to be that big a deal, I'd be OK with it. And nobody's stopping people from doing web-search studies with it, to this day.

link