Hacker News new | ask | show | jobs
by r0muald 3854 days ago
I didn't know the post was taken down. The text is likely corrupted because it was written in a PDF document and Academia.edu will mangle the PDF that is shown online (via scribd?). I don't know why it was taken down.

I had downloaded the original PDF and I uploaded a copy here https://imgur.com/NUBD8nn for those who want to read the full version (I don't have the author's permission though).

1 comments

Curious. It looks like academia.edu is mangling the pdf and presenting it as an image to fool OCR, and they succeeded in fooling Google's OCR, hence the weird letter substitutions.
That's not quite what's happening.

For most PDFs on the site, we use Scribd's API to provide a nice in-browser display of the PDF, which uses Flash or HTML5 with mangled text under the hood (often separate elements for each character to preserve exact formatting), but we serve robots a plain-text version as clean as we can get it. We're definitely not trying to "fool OCR" - we want the content to be indexed!

In this case, however, this is (was) a link to a "Session" on the PDF, basically a forum for Academia.edu users to discuss it which is moderated by the author, rather than the normal public facing page, so it's actually supposed to be "meta: noindex", and we don't serve the plain-text version. It looks like somehow Google indexed it anyway, but got the Scribd version, which has text not meant for robotic consumption.