| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kalkin 3856 days ago

That's not quite what's happening.

For most PDFs on the site, we use Scribd's API to provide a nice in-browser display of the PDF, which uses Flash or HTML5 with mangled text under the hood (often separate elements for each character to preserve exact formatting), but we serve robots a plain-text version as clean as we can get it. We're definitely not trying to "fool OCR" - we want the content to be indexed!

In this case, however, this is (was) a link to a "Session" on the PDF, basically a forum for Academia.edu users to discuss it which is moderated by the author, rather than the normal public facing page, so it's actually supposed to be "meta: noindex", and we don't serve the plain-text version. It looks like somehow Google indexed it anyway, but got the Scribd version, which has text not meant for robotic consumption.