|
|
|
|
|
by pacoverdi
2306 days ago
|
|
Many years ago, I regularly had to parse specifications of protocols from various electronic exchanges. The general approach I used was to do a first pass using a Linux tool to convert it to text: pdftotext. Something like: pdftotext -layout -nopgbrk -eol unix -f $firstpage -l $lastpage -y 58 -x 0 -H 741 -W 596 "$FILE"
After that, it was a matter of writing and tweaking custom text parsers (in python or java) until the output was acceptable, generally an XML file consumed by the build (mainly to generate code).A frequent need was to parse tables describing fields (name, id, description, possible values etc.). Unfortunately, sometimes tables spanned several pages and the column width was different on every page, which made column splitting difficult. So I annotated page jumps with markers (e.g. some 'X' characters indicating where to cut). As someone else said, this is like black magic, but kind of fun :) Edit: grammar |
|
See:
https://news.ycombinator.com/item?id=22156456
In the GNU Awk User's Guide:
https://www.gnu.org/software/gawk/manual/html_node/Multiple-...
Tracking column and field widths across page breaks is ... interesting, but more tractable.