|
|
|
|
|
by crispyambulance
2338 days ago
|
|
It was really shocking when I learned that the way pdf works is as you describe, literally fragments of text with positions and essentially no semantics. I think a lot of folks find this out as I did, when they run into a project where they need to extract info from pdf documents. Without knowing anything about pdf, one can easily assume that it will be possible to do things like "can't we just extract some semantic structures like headings, tables, etc"... but nooo, it don't work that way! Discovering the true nature of pdf is major WTF moment because we're so conditioned to expect documents to have a semantic structure. It's hard to understand how a standard can take the exact opposite approach and be so successful. |
|
Imagine how bogged down and limited vector graphics would be if every element had to have semantic meaning? "This line connects the <body> of the <car> to the 13th <spoke> on the <wheel>".