Hacker News new | ask | show | jobs
by rectang 636 days ago
To what extent does reading these formats accurately require the execution of code within the documents? In other words, not just stuff like zip expansion by a library dependency of rga, but for example macros inside office documents or JavaScript inside PDFs.

Note: I have no reason to believe such code execution is actually happening — so please don't take this as FUD. My assumption is that a secure design would involve running only external code and thus would sacrifice a small amount of accuracy, possibly negligible.

3 comments

Also note that it's not necessarily safe to read these documents even if you don't intend on executing embedded code. For example, reading from pdfs uses poppler, which has had a few CVEs that could result in arbitrary code execution, mostly around image decoding. https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=poppler

(No shade to poppler intended, just the first tool on the list I looked at.)

Couldn't or shouldn't each parser be run in a container with systemd-nspawn or LXC or another container runtime? (Even if all it's doing is reading a file format into process space as NX data not code as the current user)
NX bit: https://en.wikipedia.org/wiki/NX_bit

Executable-space protection > Limitations mentions JITs and ROP: https://en.wikipedia.org/wiki/Executable-space_protection

mprotect(), VirtualAlloc[Ex] and VirtualProtect[Ex],

"NX bit: does it protect the stack?" https://security.stackexchange.com/questions/47807/nx-bit-do...

That's a qualitatively different kind of security topic, though. On the one hand, we have a bug in a tool that reads a passive format with complete accuracy. On the other we have the need to sacrifice some amount of accuracy to avoid executing embedded code in a dynamic file format.
this is why i do like to try and parse shit myself for my own tools, not that thats without risk but i dont share my.code so its untargeted. however, to support a wide variety like this the tools are ok. most code honestly in a pdf will not target pdftotext , i think. i think it would target the thing people open pdfs with like browsers and maybe a few readers like adobe and foxit reader. pdftotext seems more like an 'academic target', like a nice exersize but not very fruitful in an actual attack. i might be wrong tho.
Citation indexes are the devil and Google is hell. Try as you might to avoid it but you're already on an index. Security through obscurity isn't secure or obscure in this modern age. https://www.tandfonline.com/doi/full/10.1080/03054985.2024.2...
None of them really execute "code". Pandoc has a pretty good write up of the security implications or running it, which I think applies just as much to the other ones, with the added caveat of zip bombs.

https://pandoc.org/MANUAL.html#a-note-on-security

It's just text, this isn't ripgrepping through your excel macros, just the data that's actually in the excel file.

I don't think there's a default excel adapter in rga.

(I wanted one somewhat recently, and then doing a find for xls on the linked page returns 0 results)

You are correct that rga doesn't ship with an Excel adapter out of the box. I have an open PR [1] to allow users to process XLS and XLSX files like any other Zip archive.

[1]: https://github.com/phiresky/ripgrep-all/pull/247

On average, the macros in an Office document add features to the software and aren't run to render any content. So like toggling a group of settings or inserting some content or whatever. They may change the content, but it's done at a point in time by the user, not each time the document is opened.

And then, on average, most users don't use macros in their documents.

So yes, negligible.