Hacker News new | ask | show | jobs
by Asbostos 3849 days ago
Curious. It looks like academia.edu is mangling the pdf and presenting it as an image to fool OCR, and they succeeded in fooling Google's OCR, hence the weird letter substitutions.
1 comments

That's not quite what's happening.

For most PDFs on the site, we use Scribd's API to provide a nice in-browser display of the PDF, which uses Flash or HTML5 with mangled text under the hood (often separate elements for each character to preserve exact formatting), but we serve robots a plain-text version as clean as we can get it. We're definitely not trying to "fool OCR" - we want the content to be indexed!

In this case, however, this is (was) a link to a "Session" on the PDF, basically a forum for Academia.edu users to discuss it which is moderated by the author, rather than the normal public facing page, so it's actually supposed to be "meta: noindex", and we don't serve the plain-text version. It looks like somehow Google indexed it anyway, but got the Scribd version, which has text not meant for robotic consumption.