Thanks for this. Really quells the urge I get every so often to just code my own PDF editor, because they all suck and certainly it couldn't be THAT hard. Such hubris!
> PDF includes eight basic types of objects: Boolean values, Integer and Real numbers, Strings, Names, Arrays, Dictionaries, Streams, and the null object
Wait, this is more complete than SOAP. It may be a good idea to redo the IPC protocol with a different serialization format!
7.5.6 "Incremental updates" from the specification is an interesting section too, speaking about accessing data people didn't think to remove from PDF files properly.
I did a bunch of work creating pdfs using a low-level API, object goes here stuff.
As far as I understand it, at its core, pdf is just a stream of instructions that is continually modifying the document. You can insert a thousand objects before you start the next word in a paragraph. And this is just the most basic stuff. Anything on a page can be anywhere in the stream. I don't know if you can go back and edit previous pages, you might have a shot at least trying to understand one page at a time.
Did you know you can have embedded XML in PDFs? You can have a paper form with all the data filled in and include an XML version of that for any computer systems that would like an easier way to read it.
Should take... a weekend tops? ;) PDF is crazy and scary