Hacker News new | ask | show | jobs
by yxdfasdjkljasdf 3902 days ago
That is not how HTTP works; your analogy is not correct.

Nobody is taking anything. If you don't want someone to access your page, then don't respond to their request.

2 comments

Since there's no easy way to always reliably identify the requester, this gets complicated.

Most scrapers - including this one - advertise how they use multiple servers/locations/ips/etc to get around this.

I fail to see a problem you are trying to present.

Even if identification was hard, which is not true because of how HTTP works, it is irrelevant because HTTP doesn't discriminate. If someone does, that is their problem, and should be solved by them, and not a committee or law.

> If you don't want someone to access your page, then don't respond to their request

> there's no easy way to always reliably identify the requester

That's the problem: you can't identify the person to block them in the first place.

Robots.txt is actually an explicit signal of intention for reputable search engines but that's all we have today and is easily ignored and does not work with these scrapers or anyone else.

Not sure what your last sentence means.

At a high enough frequency, scraping is indistinguishable from a DDoS attack. Do you believe DDoS attacks are OK? How do you draw the line?
DDoS attacks are malicious events that disrupt service. In almost 100% of cases, scrapers don't want to disrupt service, because they need the data they're scraping. They want to be able to continue to get it, so they won't do things that may harm their ability to do that (including presenting honest IPs and user agents).

Services like this one actually make scraper-related unavailability, which IMO is already greatly exaggerated, less likely, since there will be fewer amateurs trying to write their own bots and accidentally breaking things.

To the extent that a scraper harms the other business, the scraping company can be held civilly liable on several accounts without specifically bringing scraping as a practice into the picture. All that matters is that they damaged the target site's ability to operate, not that they were saving [portions of] the pages (that'd be a separate copyright claim, unrelated to the disruption of service).

There is a clear distinction in the two. You are presenting a straw-man argument.
You haven't quite laid out your argument so I have to guess what it is.

When you say "That is not how HTTP works" it suggests that your claim is that anything that HTTP allows is ethically OK to do. However that is clearly a ridiculous stance, since a DDoS attack is a stream of valid HTTP requests and that's clearly not OK.

So I'm left wondering what your argument actually is for why unwelcome scraping is ethically OK.

I find this an interesting question, because while I would love for protcols to also define ethics, I feel that would be scope creep for the poor protocol designers. There's a wide variety of conduct and ethics questions that a protocol cannot address.

Where I myself draw the line is at protocol behavior intentionally designed to obscure my intentions. For example, sending my requests from a wide variety of IP addresses is behavior that is specifically designed to obscure where I'm coming from; my only intent in doing so would be to circumvent the intent of the serving machine from providing lots of content to a single requestor. At that point I'm engaging in deceptive behavior; I've crossed an ethical line.

When you say "That is not how HTTP works" it suggests that your claim is that anything that HTTP allows is ethically OK to do. However that is clearly a ridiculous stance, since a DDoS attack is a stream of valid HTTP requests and that's clearly not OK.

That wasn't a response made to your comment, and you are mixing two different arguments there. You guess in not correct.

So I'm left wondering what your argument actually is for why unwelcome scraping is ethically OK.

I never even suggested such an argument.

The behavior you described in the last paragraph is only deceptive from the eyes of an information and privacy surveillant state actor. Anonymity is not unethical, it is a human right.