| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rectang 636 days ago
	To what extent does reading these formats accurately require the execution of code within the documents? In other words, not just stuff like zip expansion by a library dependency of rga, but for example macros inside office documents or JavaScript inside PDFs. Note: I have no reason to believe such code execution is actually happening — so please don't take this as FUD. My assumption is that a secure design would involve running only external code and thus would sacrifice a small amount of accuracy, possibly negligible.

3 comments

fwip 636 days ago

Also note that it's not necessarily safe to read these documents even if you don't intend on executing embedded code. For example, reading from pdfs uses poppler, which has had a few CVEs that could result in arbitrary code execution, mostly around image decoding. https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=poppler

(No shade to poppler intended, just the first tool on the list I looked at.)

link

westurner 636 days ago

Couldn't or shouldn't each parser be run in a container with systemd-nspawn or LXC or another container runtime? (Even if all it's doing is reading a file format into process space as NX data not code as the current user)

link

westurner 635 days ago

NX bit: https://en.wikipedia.org/wiki/NX_bit

Executable-space protection > Limitations mentions JITs and ROP: https://en.wikipedia.org/wiki/Executable-space_protection

mprotect(), VirtualAlloc[Ex] and VirtualProtect[Ex],

"NX bit: does it protect the stack?" https://security.stackexchange.com/questions/47807/nx-bit-do...

link

rectang 636 days ago

That's a qualitatively different kind of security topic, though. On the one hand, we have a bug in a tool that reads a passive format with complete accuracy. On the other we have the need to sacrifice some amount of accuracy to avoid executing embedded code in a dynamic file format.

link

sim7c00 636 days ago

this is why i do like to try and parse shit myself for my own tools, not that thats without risk but i dont share my.code so its untargeted. however, to support a wide variety like this the tools are ok. most code honestly in a pdf will not target pdftotext , i think. i think it would target the thing people open pdfs with like browsers and maybe a few readers like adobe and foxit reader. pdftotext seems more like an 'academic target', like a nice exersize but not very fruitful in an actual attack. i might be wrong tho.

link

sadboi31 636 days ago

Citation indexes are the devil and Google is hell. Try as you might to avoid it but you're already on an index. Security through obscurity isn't secure or obscure in this modern age. https://www.tandfonline.com/doi/full/10.1080/03054985.2024.2...

link

traverseda 636 days ago

None of them really execute "code". Pandoc has a pretty good write up of the security implications or running it, which I think applies just as much to the other ones, with the added caveat of zip bombs.

https://pandoc.org/MANUAL.html#a-note-on-security

It's just text, this isn't ripgrepping through your excel macros, just the data that's actually in the excel file.

link

maxerickson 636 days ago

I don't think there's a default excel adapter in rga.

(I wanted one somewhat recently, and then doing a find for xls on the linked page returns 0 results)

link

lafrenierejm 635 days ago

You are correct that rga doesn't ship with an Excel adapter out of the box. I have an open PR [1] to allow users to process XLS and XLSX files like any other Zip archive.

[1]: https://github.com/phiresky/ripgrep-all/pull/247

link

maxerickson 636 days ago

On average, the macros in an Office document add features to the software and aren't run to render any content. So like toggling a group of settings or inserting some content or whatever. They may change the content, but it's done at a point in time by the user, not each time the document is opened.

And then, on average, most users don't use macros in their documents.

So yes, negligible.

link