|
|
|
|
|
by earth_walker
1653 days ago
|
|
I started doing this for a niche area: US and European regulations and guidance documents for Good Laboratory Practice, and later for Canadian Cannabis regulations. Basically I created a standard XML schema for regulations and parsed them into XML [1]. This allowed for e.g. presenting tables of contents and section folding, pulling and linking definitions into their own search engine, etc. [2] I thought that I could easily write a parser for each jurisdiction's formats, and then get predicate rules and related regulations for free. I was wrong. a) there are many jurisdictions and sub-groups all doing their own thing; and b) most don't have any standard document formatting or tagging, let alone a defined structure. Even in the most structured formats (like the US eCFR's XML) the focus is on display rather than content. In the worst cases it was just whoever wrote up the Word document chose how they numbered and formatted chapters and sections etc. There were so many special cases that it was a huge amount of work to add or update each document, and I ended up doing a lot of categorization and fixing by hand. [1] I know people hate XML on HN, but I did my research and had specific reasons for choosing it at the time, including human readable, nesting sections, being able to easily publish and validate a schema, etc. [2] See ReadtheRegs.com. You can browse the definitions page without an account. |
|
[1] Table of contents for XML files: https://www.gesetze-im-internet.de/gii-toc.xml