We're building a tracker for EU legislative process. There's xml markup for legislative documents (akoma ntoso) and we need to transform the pdfs that the EU publishes into it to allow, for example, user annotation (and just good html representation in general. We've built on this South African project: https://github.com/longhotsummer/slaw