| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by xurukefi 124 days ago
	Kinda off-topic, but has anyone figured out how archive.today manages to bypass paywalls so reliably? I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous. I figured that they have found an (automated) way to imitate Googlebot really well.

7 comments

jsheard 124 days ago

> I figured that they have found an (automated) way to imitate Googlebot really well.

If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same.

xurukefi 124 days ago

There are ways to work around this. I've just tested this: I've used the URL inspection tool of Google Search Console to fetch a URL from my website, which I've configured to redirect to a paywalled news article. Turns out the crawler follows that redirect and gives me the full source code of the redirected web site, without any paywall.

That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot.

jsheard 124 days ago

I'd file that under "doesn't know what they're doing" because the search console uses a totally different user-agent (Google-InspectionTool) and the site is blindly treating it the same as Googlebot :P

Presumably they are just matching on *Google* and calling it a day.

xurukefi 124 days ago

Sure, but maybe there are other ways to control Googlebot in a similar fashion. Maybe even with a pristine looking User-Agent header.

Aurornis 124 days ago

> which I've configured to redirect to a paywalled news article.

Which specific site with a paywall?

Aurornis 124 days ago

> I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous.

The curious part is that they allow web scraping arbitrary pages on demand. So if a publisher could put in a lot of arbitrary requests to archive their own pages and see them all coming from a single account or small subset of accounts.

I hope they haven't been stealing cookies from actual users through a botnet or something.

xurukefi 124 days ago

Exactly. If I was an admin of a popular news website I would try to archive some articles and look at the access logs in the backend. This cannot be too hard to figure out.

coppsilgold 124 days ago

You don't even need active measures. If a publisher is serious about tracing traitors there are algorithms for that (which are used by streamers to trace pirates). It's called "Traitor Tracing" in the literature. The idea is to embed watermarks following a specific pattern that would point to a traitor or even a coalition of traitors acting in concert.

It would be challenging to do with text, but is certainly doable with images - and articles contain those.

bawolff 124 days ago

You need that sort of thing (i.e. watermarking) when people are intentionally trying to hide who did it.

In the archive.today case, it looks pretty automated. Surely just adding an html comment would be sufficient.

fc417fc802 124 days ago

If they use paid accounts I would expect them to strip info automatically. An "obvious" way to do that is to diff the output from two separate accounts on separate hardware connecting from separate regions. Streaming services commonly employ per-session randomized stenographic watermarks to thwart such tactics. Thus we should expect major publishers to do so as well.

At which point we still lack a satisfactory answer to the question. Just how is archive.today reliably bypassing paywalls on short notice? If it's via paid accounts you would expect they would burn accounts at an unsustainable rate.

ouhamouch 124 days ago

Watch https://news.ycombinator.com/threads?id=1vuio0pswjnm7 they post AT-free recipes for many paywalls

tonymet 124 days ago

I’m an outsider with experience building crawlers. You can get pretty far with residential proxies and browser fingerprint optimization. Most of the b-tier publishers use RBC and heuristics that can be “worked around” with moderate effort.

quietsegfault 124 days ago

.. but what about subscription only, paywalled sources?

tonymet 124 days ago

many publisher's offer "first one's free".

For those that don't , I would guess archive.today is using malware to piggyback off of subscriptions.

elzbardico 124 days ago

> which is, of course, ridiculous.

Why? in the world of web scrapping this is pretty common.

xurukefi 124 days ago

Because it works too reliably. Imagine what that would entail. Managing thousands of accounts. You would need to ensure to strip the account details form archived peages perfectly. Every time the website changes its code even slightly you are at risk of losing one of your accounts. It would constantly break and would be an absolute nightmare to maintain. I've personally never encountered such a failure on a paywalled news article. archive.today managed to give me a non-paywalled clean version every single time.

Maybe they use accounts for some special sites. But there is definetly some automated generic magic happening that manages to bypass paywalls of news outlets. Probably something Googlebot related, because those websites usually give Google their news pages without a paywall, probably for SEO reasons.

mikkupikku 124 days ago

Using two or more accounts could help you automatically strip account details.

xurukefi 124 days ago

That's actually a really neat idea.

wbmva 123 days ago

Do you know where the doxxed info ultimately originates from? It turns out that the archives leaked account names. Try Googling what happened to volth on Github.

permo-w 124 days ago

I could be wrong, but I think I've seen it fail on more obscure sites. But yeah it seems unlikely they're maintaining so many premium accounts. On the other hand they could simply be state-backed. Let's say there are 1000 likely paywalled sites, 20 accounts for each = 20k accounts, $10/month => $200k/month = $2.4m a year. If I were an intelligence agency I'd happily drop that plus costs to own half the archived content on the internet.

Surely it wouldn't be too hard to test. Just set up an unlisted dummy paywall site, archive it a few times and see what the requests looks like.

Jordan-117 123 days ago

Interesting theory. It would also be a good way to subtly undermine the viability of news outlets, not to mention the insidious potential of altering snapshots at will. OTOH, I'd expect a state-sponsored effort to be more professional in terms of not threatening and smearing some blogger who questioned them.

permo-w 123 days ago

If I were an intelligence agency wanting to throw people off my scent, maybe I'd set up or pay off a blogger to track down my site's "owner" and then do some immature shit in response to absolutely confirm forever that the blogger was right.

Not saying this is true, just saying it could be

behringer 124 days ago

Replace any identifiers like usernames and emails with another string automatically.

cnst 123 days ago

It's because it's actively maintained, and bypassing the paywalls is its whole selling point, thus, they do have to be good at it.

They bypass the rendering issues by "altering" the webpages. It's not uncommon to archive a page, and see nothing because of the paywalls; but then later on, the same page is silently fixed. They have a Tumblr where you can ask them questions; at one point, it's been quite common for everyone to ask them to fix random specific pages, which they did promptly.

Honestly, you cannot archive a modern page, unless you alter it. Yet they're now being attacked under the pretence of "altering" webpages, but that's never been a secret, and it's technologically impossible to archive without altering.

Jordan-117 123 days ago

There's a pretty massive difference between altering a snapshot to make it archivable/readable and doing it to smear and defame a blogger who wrote about you.

Cider9986 124 days ago

I imagine accounts are the only way that archive.today works on sites like 404media.co that seem to have server sided paywalls. Similarly, twitter has a completely server sided paywall.

layer8 124 days ago

It’s not reliable, in the sense that there are many paywalled sites that it’s unable to archive.

xurukefi 124 days ago

But it is reliable in the sense that if it works for a site, then it usually never fails.

tonymet 124 days ago

no tool is 100% effective. Archive.today is the best one we've seen