Hacker News new | ask | show | jobs
by bityard 505 days ago
Please put a priority on making it hard to abuse the web with your tool.

At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that. It goes without saying that you should not allow users to make hundreds or thousands of "blind" parallel requests as these tend to have the effect of DoSing sites that are being hosted on modest hardware. You should also be measuring response times and throttling your requests accordingly. If a website issues a response code or other signal that you are hitting it too fast or too often, slow down.

I say this because since around the start of the new year, AI bots have been ravaging what's left of the open web and causing REAL stress and problems for admins of small and mid-sized websites and their human visitors: https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-sit...

10 comments

This is HN virtue signaling. Some fringe tool that ~nobody uses is held to a different, weird standard and must be the one to kneecap itself with a pointless gesture and a fake ethical burden.

The comparison to DRM makes sense. Gimping software to disempower the end user based on the desires of content publishers. There's even probably a valid syllogism that could make you bite the bullet on browsers forcing you to render ads.

Please don't.

Software I installed on my computer needs to the what I want as the user. I don't want every random thing I install to come with DRM.

The project looks useful, and if it ends up getting popular I imagine someone would make a DRM-free version anyway.

Where do you read DRM?

Parent commenter merely and humbly asks the author of the library to make sure that it has sane defaults and support for ethical crawling.

I find it disturbing that you would recommend against that.

Here's what the parent comment wrote.

> And there should not be an option to override that.

This is not just a sane default. This is software telling you what you are allowed to do based on what the rights owner wants, literally DRM.

This is exactly like Android not allowing screenshots to be taken in certain apps because the rights owner didn't allow it.

Not sure what "digital rights" that "manages"? I don't see it as an unreasonable suggestion that the tool shouldn't be set up out of the box to DoS sites it's scraping, that doesn't prevent anyone who is technical enough to know what they're doing to fork it and remove whatever limits are there by default? I can't see it as a "my computer should do what I want!" issue, if you don't like how this package works, change it or use another?
Digital Restrictions Management, then. Have it your way.
There are so many combative people on HackerNews lately who insist to misinterpret everything.

I really wonder if it's bots or just assholes.

This is software telling you what you are allowed to do based on what the software developer wants* (assuming the developers cares of course...). Which is how all software works. I would not want my users of my software doing anything malicious with it, so I would not give them the option.

If I create an open-source messaging app I am also not going to give users the option of clicking a button to spam recipients with dick pics. Even if it was dead-simple for a determined user to add code for this dick pic button themselves.

> I find it disturbing

Oh no, someone on the internet found something offensive!

Disturbing, not offensive - it is literally right there in the quote you have been so nice to pass along.
Who told you about DRM? It is an open source tool.

Simply requiring a code change and a rebuild is enough of a barrier to prevent rude behavior from most people. You won't stop competent malicious actors but you can at least encourage good behavior. If popular, someone will make a fork but having the original refuse to do stuff that are deemed abusive sends a message.

It is like for the Flipper Zero. The original version does not let you access frequency bands that are illegal in some countries, and anything involving jamming is highly frowned upon. Of course, there are forks that let you do these things, but the simple fact that you need to go out of your way to find these should tell you it is not a good idea.

I feel like you may have a misunderstanding of what DRM is. Talking about DRM outside the context of media distribution doesn't really make any sense.

Yes, someone can fork this and modify it however they want. They can already do the same with curl, Firefox, Chromium, etc. The point is that this is project is deliberately advertising itself as an AI-friendly web scraper. If successful, lots of people who don't know any better are going to download it and deploy it without a full understanding (and possibly caring) of the consequences on the open web. And as I already point out, this is not hypothetical, it is already happening. Right now. As we speak.

Do you want cloudflare everywhere? This is how you get cloudflare everywhere.

My plea for the dev is that they choose to take the high road and put web-server-friendly SANE DEFAULTS in place to curb the bulk of abusive web scraping behavior to lessen the number of gray hairs it causes web admins like myself. That is all.

It's exactly DRM, management of legal access to digital content. The "media" part has been optional for decades.

The comment they replied to didn't suggest sane defaults, but DRM. Here's the quote, no defaults work that way (inability to override):

> At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that.

I'll also add something that I expect to be somewhat controversial, given earlier conversations on HN[0]: I see contexts in which it would be perfectly valid to use this and ignore robots.txt.

If I were directing some LLM agent to specifically access a site on my behalf, and get a usable digest of that information to answer questions, or whever, that use of the headless browser is not a spider, it's a user agent. Just an unusual one.

The amount of traffic generated is consistent with browsing, not scraping. So no, I don't think building in a mandatory robots.txt respecter is a reasonable ask. Someone who wants to deploy it at scale while ignoring robots.txt is just going to disable that, and it causes problems for legitimate use cases where the headless browser is not a robot in any reasonable or normal interpretation of the term.

[0]: I don't entirely understand why this is controversial, but it was.

> Talking about DRM outside the context of media distribution doesn't really make any sense.

It’s a cultural thing, and it makes a lot of sense. This fits with DRM culture that has walled gardens in iOS and Android.

i still wont forgive libtorrent for not implementing sequential access.

and also, xpdf for implementing the "you cant select text" feature

That would make it impossible to use this as a testing tool. How should automatic testing of web applications work, if you obey all of these rules? There is also the problem of load testing. This kind of stuff is by its nature of dual use, a load test is also a kind of DDOS attack.
Make it faster and furiouser.

There are so many variables involved that it’s hard to predict what it will mean for the open web to have a faster alternative to headless Chrome. At least it isn’t controlled by Google directly or indirectly (Mozilla’s funding source) or Apple.

If it's already a problem, nothing this developer does will improve it, including crippling their software and removing arguably legitimate use cases.
Nerfing developer tools to save the "open web" is such a fucking backward argument.
In 10 lines of code I could create a proxy tool that removes all your suggested guidelines so the scraper still operates. In other words. Not really helping.
Its literally open source, any effort put into hamstringing it would just be forked and removed lol
Any barrier to abuse makes abuse harder.
Not really lol, literally if you add a robots.txt check someone can just create a fork repo with a git action that escapes that routine every time the original is pushed... Adding options for filter and respecting things is great even as defaults, but trying to force "good behavior" tends to just lead to people setting up a workaround that everyone eventually uses because why use the hamstrung version instead of the open version and make your own choices.
Yes! Having long time ago done some minor web scraping, I did not put any work at all into following robots.txt, simply because it seemed like a hassle and I thought "meh it's not that much traffic is it and boss wants this done yesterday". But if the tool defaulted to following robots.txt I certainly wouldn't have minded, it would have caused me to get less noise and my tool to behave better.

Also, throttling requests and following robots.txt actually makes it less likely that your scraper will be blocked, so even for those who don't care about the ethics, it's a good thing to have ethical defaults.

This is why I’m making crawlspace.dev, a crawling PaaS that respects robots.txt, implements proper caching, etc by default.
Looking at the responses here, I'm glad I just chose to paywall to protect against LLM training data collection crawling abuse.[1]

[1]: https://lgug2z.com/articles/in-the-age-of-ai-crawlers-i-have...