| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by biosed 1690 days ago

I used to lead Sys Eng for a FTSE 100 company. Our data was valuable but only for a short amount of time. We were constantly scraped which cost us in hosting etc. We even seen competitors use our figures (good ones used it to offset their prices, bad ones just used it straight). As the article suggest, we couldn't block mobile operator IPs, some had over 100k customers behind them. Forcing the users to login did little as the scrapers just created accounts. We had a few approaches that minimised the scraping:

Rate Limiting by login,

Limiting data to know workflows ...

But our most fruitful effort was when we removed limits and started giving "bad" data. By bad I mean alter the price up or down by a small percentage. This hit them in the pocket but again, wasn't a golden bullet. If the customer made a transaction on the altered figure we we informed them and took it at the correct price.

It's a cool problem to tackle but it is just an arms race.

4 comments

rootusrootus 1690 days ago

I know a guy at Nike that had to deal with a similar problem. As I recall, they basically gave in -- instead of trying to fight the scrapers, they built them an API so they'd quit trashing the performance of the retail site with all the scraping.

matheusmoreira 1690 days ago

Yes. That's exactly what everyone should do.

cbsmith 1689 days ago

Well, not EXACTLY. The exactly should be to just do WebSub/PuSH. No need to invent your own thing and hope that bots learn how to use it properly.

matheusmoreira 1689 days ago

Agreed. What I mean is people need to stop fighting these pointless battles.

echelon 1690 days ago

If data is your competitive advantage or product, then what? Accept that your market no longer exists and that there's no way to stop theft?

ianbutler 1690 days ago

You're going to need to explain how scraping publicly available information on a website is theft.

If information is your competitive advantage maybe you shouldn't have it on a publicly accessible website, and should instead stick it behind an API with pay tiers and a very clear license regarding what you may do with it as an end user.

Note, a simple sign up being required to view a website makes it not publicly available information any longer and you can cover usage, again, in a license.

Then you have a whole bunch of legal avenues you can use to protect your work. Assuming you can afford it that is.

emodendroket 1690 days ago

How practical is this really though? Like, imagine you're a newspaper. Unless you're the FT or Wall Street Journal or something like that, nobody is making an account to read an article. They'll just go somewhere else.

simondotau 1690 days ago

> You're going to need to explain how scraping publicly available information on a website is theft.

Seriously? Do I need to explain why a song doesn’t enter the public domain when it is played on the radio?

matheusmoreira 1690 days ago

Do I need to explain that copyright is practically unenforceable in the 21st century? Data is trivially copied and there's nothing you can do to fight that, no amount of laws will ever make it non-trivial again. Even if you successfully sue somebody for this, it won't stop them.

At some point people are gonna have to accept this.

throwawaygh 1690 days ago

OP was talking about price lists. IANAL but AFAIK you can't copyright a list of prices.

Grimm1 1690 days ago

No but those are substantively different situations such that this exact thing is being argued in the highest courts of the US. It's not quite the cut and clear case you seem to believe it to be.

cbsmith 1689 days ago

No, but there is a legit philosophical argument about theft when it comes to copyright. There are two ways to look at theft: acquiring something you didn't earn vs. someone losing something they did earn. Generally, we tend to focus on the latter. From that perspective, "copying" is really not "theft", and arguably "copyright" does more net societal harm than any benefit it provides.

achillesheels 1690 days ago

It is copyright information, no? So technically it is intellectual property theft if the scraping use is for commercial purposes.

rmbyrro 1690 days ago

Not all information falls under copyright.

If you build a database of touristic places and display in your website, the information is not protected by copyright.

In Europe they have laws covering _sui generis database rights_, but they are from another era and unenforceable nowadays.

Grimm1 1690 days ago

No? If you place information publicly on a website it's pretty much free game, no copyright violation, especially regarding user generated information. That's my take, but legally it's a gray area and it's still going back and forth in the courts (at least in the US) but for a while before a decision was vacated by the supreme court scraping publicly available information on a site was legally protected and seemingly inline with my thoughts on it.

chadwittman 1690 days ago

The real Jedi move

wrycoder 1690 days ago

Especially if you charge for it, which would save them money, because they wouldn't have to redo their code every time you changed your website.

gonzo41 1690 days ago

I think there's an opportunity for a new JS framework to have something like randomly generated dom that will always display the page and elements the same to a human but constantly break paths for computers.

Like displaying a table with semantic elements, then divs, then using an iframe with css grid and floating values over the top.

This almost seems like a problem for AI to solve.

blueboo 1690 days ago

Even if your DOM is obfuscated, the rendered page remains vulnerable to OCR. Obfuscate the rendered pixels and you’ll annoy your humans and eventually find that the scrapers’ OCR is superhuman.

Still, maybe AI comes into it. Maybe poisoning the data is the right way to do it conditioned on ML-juiced anomaly detection.

Lifelarper 1689 days ago

pdf and print newspaper is still a massive pain in the ass to OCR accurately

rootusrootus 1690 days ago

To some extent those already exist and I get annoyed by them when they cause 1Password to be useless on their login page. But it probably would help with algorithmic scraping.

zarzavat 1690 days ago

This is already common. It's mildly annoying for scrapers but generally a waste of time since you can usually still orient yourself based on the content of the nodes.

ghusbands 1690 days ago

This would have huge accessibility issues, breaking screen readers and the like.

kall 1689 days ago

We already have react-native-web (<3), so we have that covered.

endymi0n 1690 days ago

> It's a cool problem to tackle but it is just an arms race.

Plus, it's one you're going to lose. I was once asked at an All-Hands why we don't defend ourselves against bots even more vigorously.

My answer was: "Because I don't know how to build a publically available website that I could not scrape myself if I really wanted to."

wolverine876 1690 days ago

> But our most fruitful effort was when we removed limits and started giving "bad" data. By bad I mean alter the price up or down by a small percentage. ... If the customer made a transaction on the altered figure we we informed them and took it at the correct price.

Is that legal? It would be a big blow to trust if I was the customer, but that's without knowing what you were selling and in what market.

killingtime74 1690 days ago

It’s legal if it’s in the contract. Standard for contracts to allow for mistakes and confirmations of prices

kwhitefoot 1690 days ago

It's not mistake if you do it deliberately!

killingtime74 1690 days ago

Yes (not saying it's a mistake) but putting confirmation can be in the contract, no law says you only get 1 chance to display price.

ransom1538 1690 days ago

I love the honey pot approach. Put tons of valued hrefs on the page that are invisible (css) that the scrapper would find. Then just rate limit that ip address and randomize the data coming back. Profit.

histriosum 1689 days ago

I think this falls into the "arms race" trap, though. If you can make an href invisible via CSS, then the scraper can certainly be written to understand CSS, and thus filter out the invisible hrefs..