| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nwh 4537 days ago
	More likely archive.is is pretending to be Googlebot. The author has a lot of server side tricks, like fake Facebook accounts so that it can archive Facebook pages properly.

1 comments

fabulist 4537 days ago

I'm attempting to replicate this. Searching the last sentence (which is behind the paywall) brings it up in google right away, so I think you're right. However, using Googlebot's user agent doesn't work, so it must be slightly more sophisticated. The result in Google is also not-paywalled, though going directly to the link is. So maybe they use a simpler strategy, and just mess with the parameters. This is the result from google: http://online.wsj.com/news/articles/SB1000142405270230388060...

link

nwh 4537 days ago

Searching at Google for

    "cache:http://online.wsj.com/news/articles/SB10001424052702303880604579405852448992982?"

gets me the full text article, they could just be stripping the header from the page and displaying that? I know it does detection of cached Google pages in some circumstances.

link