Defeating AI scraping by rethinking webpage rendering

Y	Hacker News new \| ask \| show \| jobs

	Defeating AI scraping by rethinking webpage rendering
	2 points by exodys 147 days ago
	Consider the idea of someone creating a project that rendered out webpages in images and sent those over the web; updating them whenever an input is received, much like a video game input loop. If everything was server side rendered, how difficult would it be for scraping? The idea of an un-copyable webpage is enticing, assuming that you would not like your data scraped. I know computer vision is a thing, but the error rate may be enough?

4 comments

bryanrasmussen 147 days ago

Now, the second question: What do you think the performance is of a page rendered to one big image and downloaded to render.

Web pages can render in pieces, images not so much. At least not the way web pages can. What is the resolution of a web page - the resolution of a web page really depends on the browser and the OS, some web pages render really high definition because that is what their OS allows (Macs for example), some browsers have more color spaces available than just RGB - many nowadays, so if your site uses more advanced color spaces are you going to render to an RGB image, meaning that your customers get less popping designs with your solution than with the browser. Or are you going to render to the most advanced image resolution possible meaning the images are going to be even bigger and it will be even harder to download.

Are you going to render multiple resolutions to give the correct resolution to user agent, so that you can save on bandwidth - by having done more renders on the server and having your customer pay for more renders.

What is caching behavior here?

I believe performance of this solution would by necessity be sub-optimal. Nobody likes a sub-optimal performance on the web, because almost all of the web is entertainment development, and people won't accept poor performance on their entertainment.

https://medium.com/luminasticity/on-premature-optimization-i...

link

bryanrasmussen 147 days ago

I believe you would run into accessibility laws that would make your project extremely illegal.

on edit: actually probably not illegal, that would be the wrong word, extremely open to financially ruinous lawsuits would be the correct phrasing.

link

exodys 147 days ago

I guess it would have to provide audio in order to be accessible, and if blind and deaf... well, do most people prepare applications for the blind and deaf?

link

bryanrasmussen 147 days ago

in the EU or the US where there are laws requiring that things are accessibly you can be sued if you are not, the EU law is quite new, the US is not. Some of the things that go into a decision as to what the amount will be you are fined is how inaccessible you are, so an image would be not at all.

If you are rendering a single image of a page - does this page have interactive parts? How are you planning on people actually interacting with the image? If you have it solved for sighted people interacting with the graphical rendering of your application you also have to provide solutions for people who are not sighted, have mobility issues, combinations of the two...

If you have provided an image that is not at all accessible, how much is because you haven't any understanding of accessibility and how much is "screw the disabled" thinking, because this would also affect how much you get fined.

When you first get fined and the plaintiff against you gets awarded money because your page is inaccessible (this would be the US), it's not over. Because there are still disabled people who want to use your application and they can't, and they will sue too, and you get fined more and more because you got fined once and you kept up with your behavior. Sooner or later you might receive a running court order - make your stuff accessible or pay this fee until it is. This sooner or later would probably be what would happen in EU, you have until this date to make your site accessible, or you are getting fined a lot.

Since your solution is totally not architected for accessibility you will need to put a lot of work into it.

Finally providing audio may or may not be considered good enough, depending on a lot of things, but most disabled people use screen readers that interact with the DOM and the Accessibility Object Model (depending on version of the application, figure as big a variation between screen readers as between IE 6, 9, Safari 2 years ago, Newest Chrome, and Firefox) if you provide something that they cannot use with their preferred accessibility tool I would bet it wouldn't be considered good enough.

The fact is your idea is not going to work as a tool for others to use, because accessibility is a legal requirement for lots of people, you will probably be sued for a lot of money, if any company did use it they would probably get sued too, maybe they would sue you if they didn't like getting sued, and in the end the AI companies can probably figure out the graphics you generate good enough to scrape it anyway.

The most likely result for following this plan to build a product would be financial ruination, how great would really depend on how initially successful you were. The greater the initial success the worse off you would eventually be.

on edit: link to the accessibility object model I forgot to put in earlier https://wicg.github.io/aom/spec/ - it's not exactly something that they work with now, but it is under development so yes, just like specs when they are under development come out in early releases with specific browsers. At any rate, it is not designed to work with complete web pages rendered as an image.

link

bryanrasmussen 147 days ago

sorry I don't normally go around poking holes in people's hopeful business plans, but I have thought about the problem of keeping sites from being scraped before, and there are really two issues that everybody needs to work that go against a decent anti-scraping tool and that is

automated testing requirements make it difficult to keep things from scraped, because scraping is an automation process, and automated testing is obviously and automation process.

and the needs for accessibility. Which even if you didn't care about the moral requirements of being accessible the legal requirements force you to be.

And I have also given some thoughts to automated generation of images for a sort of graphing application so I am familiar with the performance issues as well, so your question of course hit all the points that I knew something about.

link

bryanrasmussen 147 days ago

the real battle with anti-scraping is human heuristic identification, which to get around a scraper needs to make their automated process behave more and more like a human, which results in making the process less and less financially rewarding.

link