|
|
|
|
|
by jfk13
1964 days ago
|
|
This reflects the fact that PDF uses the UTF-16BE encoding form for Unicode text, not UTF-8. One oddity is that the PDF spec's description of the "Text String Type", e.g. at p.158 in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_ref..., appears to say there should be a leading BOM (U+FEFF) in the string here, but this doesn't seem to be the case in reality. Indeed, adding one may cause issues for at least some PDF readers, according to discussion at https://github.com/tesseract-ocr/tesseract/issues/1150. But anyhow, in short: that's simply UTF-16BE text being represented as a series of bytes. It's nothing to do with any kind of "security hack", and the null bytes are not "bad characters", they're the high byte of each UTF-16 code unit. |
|