| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bityard 512 days ago

I feel like you may have a misunderstanding of what DRM is. Talking about DRM outside the context of media distribution doesn't really make any sense.

Yes, someone can fork this and modify it however they want. They can already do the same with curl, Firefox, Chromium, etc. The point is that this is project is deliberately advertising itself as an AI-friendly web scraper. If successful, lots of people who don't know any better are going to download it and deploy it without a full understanding (and possibly caring) of the consequences on the open web. And as I already point out, this is not hypothetical, it is already happening. Right now. As we speak.

Do you want cloudflare everywhere? This is how you get cloudflare everywhere.

My plea for the dev is that they choose to take the high road and put web-server-friendly SANE DEFAULTS in place to curb the bulk of abusive web scraping behavior to lessen the number of gray hairs it causes web admins like myself. That is all.

2 comments

randunel 512 days ago

It's exactly DRM, management of legal access to digital content. The "media" part has been optional for decades.

The comment they replied to didn't suggest sane defaults, but DRM. Here's the quote, no defaults work that way (inability to override):

> At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that.

link

samatman 512 days ago

I'll also add something that I expect to be somewhat controversial, given earlier conversations on HN[0]: I see contexts in which it would be perfectly valid to use this and ignore robots.txt.

If I were directing some LLM agent to specifically access a site on my behalf, and get a usable digest of that information to answer questions, or whever, that use of the headless browser is not a spider, it's a user agent. Just an unusual one.

The amount of traffic generated is consistent with browsing, not scraping. So no, I don't think building in a mandatory robots.txt respecter is a reasonable ask. Someone who wants to deploy it at scale while ignoring robots.txt is just going to disable that, and it causes problems for legitimate use cases where the headless browser is not a robot in any reasonable or normal interpretation of the term.

[0]: I don't entirely understand why this is controversial, but it was.

link

benatkin 512 days ago

> Talking about DRM outside the context of media distribution doesn't really make any sense.

It’s a cultural thing, and it makes a lot of sense. This fits with DRM culture that has walled gardens in iOS and Android.

link