Hacker News new | ask | show | jobs
by jastingo 847 days ago
What are the legitimate (i.e. legal) use cases for a product such as this?

I agree with another comment that called this "Abuse as a Service". It seems to me this product's design goal is nothing more than to circumvent measures site owners take to prevent abuse of their site and run a sustainable business.

11 comments

What about something like Nitter? Archiving? Adversarial bridging between different platforms? Automation?

How will well-behaved scrapers undermine the sustainability of a business? I guess adblocking is one, but we can already do that with uBlock and that's legal. Or adversarial bridging, but that only serves to boost competition.

In other words, the question is flipped; why would well-behaved (i.e. non-DDoSing) scrapers be illegal?

I think you're conflating automation and intentional avoidance of bot detection as part of automation. The issue I have is not that this service allows users to automate browsing activities. The issue is that this service deliberately tries to circumvent being detected as automating browser activities because websites are trying to prevent bots. There are LOTS of services that allow users to create automations without disguising themselves. If you are using well-behaved scrapers that respect TOS then you shouldn't have to use a service like this.

Nitter is an example of a service that explicitly disrupts Twitter/X's way to make money. If they can't make money then they can't provide the service, there would be no Twitter/X, and hence no Nitter. Of course they would try to prevent that kind of behavior and it should be obvious why. Resorting to using a service like this in order to continue using Nitter should raise some alarm bells. Sure you can still do it and rationalize it however you want, but you have to acknowledge you're trying to get the value of the service without paying for it.

Perhaps there are cases where there is a dissonance between a website's TOS and how they are blocking bot traffic? That sounds like a valid gripe. Otherwise, I don't buy the argument.

That's fair enough. I think that falls under similar arguments to adblocking; it's against ToS, and affects the revenues of ad-supported businesses, but it seems like the popular view is to use it regardless.
Legality isn't the question here. If you want to speak to the legality, anyone circumventing a robots txt that explicitly has your bot's user-agent and 'disallow: *' is unauthorized access (I imagine it's more nuanced for 'user-agent: *'). No website is required to allow anyone to visit and can discriminate against any client or software any way they want.
> Legality isn't the question here.

The question was literally,

> What are the legitimate (i.e. legal) use cases for a product such as this?

I've got a couple of things I've used browser automation tools for:

• I want to automate (or at least semi-automate) downloading bank statements. I've got ~14 accounts (checking, savings, credit card, IRA, investment, HSA) across 7 financial institutions.

It's tedious to go download statements from all of them manually.

• I want to save stories from FanFiction.net (FFN) for offline reading. FFN's terms allow automation as long as it doesn't operate faster than a human [1].

[1] From their TOS:

> You agree not to use or launch any automated system, including without limitation, "robots," "spiders," or "offline readers," that accesses the Service in a manner that sends more request messages to the FanFiction.Net servers in a given period of time than a human can reasonably produce in the same period by using a conventional on-line web browser.

> I want to automate (or at least semi-automate) downloading bank statements. I've got ~14 accounts (checking, savings, credit card, IRA, investment, HSA) across 7 financial institutions.

Could you not shoot an email to those institutions asking for a copy of the documents?

Not OP, but I do the same for ~7 accounts across 5 institutions. There's no need to contact them since you can manually download the statements, but it's a chore if you're doing it frequently. I usually run my script a few times a week.
> Could you not shoot an email to those institutions asking for a copy

They’ll respond within a few days, asking me to log into some web portal to prove that I am me, and then we’re back where we started

I'm genuinely scraping a certain social network that doesn't have an accessible API to do what I need. My user is logged in and I just automate the logged-in browser to go to the pages and get the data I need into a console so I can get the data I require.

If there was an accessible API to do what I need, I wouldn't do this because scraping sucks. I have to write 100 JavaScript edge cases to handle all the times the host's servers fail in very weird ways. Plus, walking DOMs on these shitty sites with 10,000 nested divs is not fun. GPT helps with this.

It's net-positive for the host though, as I upload a lot of valuable content that their users genuinely like, but it sucks that I have to be sneaky to get the data I need.

I used their previously available bot detection defeat to add an import feature to my website: Users could link to their creation on another site and my site would scrape the publicly available content so they wouldn't need to re-enter all their data

I've used their product many times actually, and I'm shocked on Hacker News of all places no one's thinking of anything besides abuse. How often is it useful to get information from a webpage and apply it in a new context? Then think of how often said webpage is behind a Cloudflare bot detector.

If it's the user's data, then under GDPR the other site is obligated to provide a way for them to download/transfer it, specifically with this use case in mind.

They are completely in the right to block you though, you're not the owner of that data, you might be breaking their TOS.

“ In exercising his or her right to data portability pursuant to paragraph 1, the data subject shall have the right to have the personal data transmitted directly from one controller to another, where technically feasible.”

They’re not necessarily in the right to block you, if you’re the data subject or acting on their behalf.

This is non-sequitur to my comment:

- GDPR doesn't require it be a convenient export. Users want to paste a link on my site, a click a button, and have it magically appear. Not fill out a form, dump their entire account and sift through that.

- I never opined on the validity of blocking bots

- I never opined on if it's breaking their TOS

Abuse implies a harmfulness. Giving users a quick import option from already public data isn't harmful.

> If it's the user's data, then under GDPR the other site is obligated to provide a way for them to download/transfer it, specifically with this use case in mind.

In Europe, if the company is actually following the law, in theory yes.

> They are completely in the right to block you though, you're not the owner of that data, you might be breaking their TOS.

IANAL, but AIUI that's definitely not true in the United States and I suspect similar ideas hold elsewhere: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

There are a litany of posts on this very site that detail why HiQ vs LinkedIn is more nuanced than you're making it out to be. HiQ didn't ultimately have the slam dunk win that people think they did.
You sell jastingo™ brand widgets, but you notice fakes are being sold on eBay, Amazon, AliExpress. You set up a scraper to search for jastingo widgets every day on every marketplace site, but you get blocked. So now you need an unblocker to enforce your copyright/trademark/patent.
Why does it need to be that complicated? If marketplaces are selling fakes, get your lawyer to send them a letter.
What if you have 10 brands, with 10 products each, and there's 10 marketplaces.
Build a service that helps companies automate sending legal letters to marketplaces.
And that service will very likely be automatically scraping different marketplaces to detect the fake products each time they pop up again.
Bots acting on the behalf of users should not be blocked but we have spent several decades treating bots (except for the googlebot) as bad.

Like if I want to programmatically unsubscribe from a subscription, why should I have to do it myself?

That's a bad example, "programmatically unsubscribing" means giving spammers information that this address is alive. A much better solution is to report the unwanted email as SPAM, so the sender's reputation takes a hit.

(and for that 1% of the cases where the address is not a spammer and user knows it, they can just hit "unsubscribe" manually)

I’m talking about subscription services a user signed up for at one time
That's a bad example, there's already the List-Unsubscribe header.
Subscription services like Netflix, not emails.
I think they should introduce request rate limits per IP/domain, for example max 1 parallel request. In this case there will be no significant load, but the data can be scraped.

Scraping is important for example, to monitor competitors' prices to see the opportunity to raise your own prices.

And let's not forget that Google does a lot more scraping than anyone else and has ridiculous profits from it.

Scanning for malicious and phishing websites. These types of sites are just enjoying the ease of free services like Cloudflare to block automated analysis tools and tailor their phishing campaigns to very specific geographical locations and user groups.
I'm not a customer, but I have a use case that in my opinion should be legal.

For years I've used my own terminal UI player (di-tui) for di.fm. At some point in the not-so-recent past, di.fm added Cloudflare's WAF, which prevents me from using one of my app's features: managing channel favorites within the app.

To be clear, I'm a paying di.fm customer, and my app only works for paying customers. But now my preferred method of listening to di.fm is slightly hamstrung because Cloudflare's WAF sits between me and little string token available to every browser that accesses di.fm (even non-paying customers).

for stuff I use similar self-hosted solution: detecting when kid lessons are available on local portal. but to be fair cheapest option here ($200) isn't usable for non-business usage

Ps: context why I need automation for such thing: those lessons are really popular and are announced at unpredictable time / there might be another spot when someone resigns

> What are the legitimate (i.e. legal) use cases for a product such as this?

Data portability! Tools like this can be used to allow individuals to export their data from hostile web services trying to hold it hostage.

Legal in the EU, with GDPR.