|
|
|
|
|
by triggercut
968 days ago
|
|
Yes, this. I've been trying to find a general way to automatically semantically chunk various legislation for a while now. Partly so as to diff various versions/amendments, but also to graph connections to other referenced legislation. Most of the time I end up having to just take half an hour to manually regex and format plain text. A particular case I have is where there is a draft bill put out for industry/community consultation. Quickly diffing the releases is the goal but for now usually relies on one (preferably two) subject matter experts to read the whole thing top to bottom to build an understanding. I don't think these would be available via the means you've secured. They are usually hosted on a relevant government entities website as PDFs One last question/comment, have you considered adding some additional reference info like the federal list of entities?[1] [1] https://www.finance.gov.au/government/managing-commonwealth-... |
|
It's possible that they're in my database. I have included the as made version of all bills on the Federal Register of Legislation. However, if they haven't had a first reading yet, then probably not.
For processing PDFs, I recommend using `pdfplumber`, which is what I used to build the Corpus. Happy to discuss further if you'd like.
> One last question/comment, have you considered adding some additional reference info like the federal list of entities?
Do mean adding additional metadata? At the moment, I've kept the number of metadata attributes as low as possible. Every attribute added equates to more work to keep it standardised across all the jurisdictions and document types. My plan is to slowly add more attributes as I have time. I'd really like to associate a date with documents but even that is a hurdle. I have to decide what date should be the date of a document (is it the time it was issued, the time it was published, the time it came into force, the time the latest version was issued, etc... and what happens when a document doesn't have a date? should I extract it from its citation? how do I preserve time zone information? etc...).