|
|
|
|
|
by kalkin
3856 days ago
|
|
That's not quite what's happening. For most PDFs on the site, we use Scribd's API to provide a nice in-browser display of the PDF, which uses Flash or HTML5 with mangled text under the hood (often separate elements for each character to preserve exact formatting), but we serve robots a plain-text version as clean as we can get it. We're definitely not trying to "fool OCR" - we want the content to be indexed! In this case, however, this is (was) a link to a "Session" on the PDF, basically a forum for Academia.edu users to discuss it which is moderated by the author, rather than the normal public facing page, so it's actually supposed to be "meta: noindex", and we don't serve the plain-text version. It looks like somehow Google indexed it anyway, but got the Scribd version, which has text not meant for robotic consumption. |
|