Hacker News new | ask | show | jobs
by benn0 962 days ago
Fantastic work, and really appreciate the write up. It's quite timely for me - I'm from a tech background and have just started studying Australian law, and was thinking about doing exactly this - so you are years ahead of me :).

Just one note - the link in your Github readme to https://umarbutler.com/open-australian-legal-corpus doesn't seem to go anywhere.

For someone interested in using the data (and help out with bugs/issues), where would you suggest starting?

1 comments

> Just one note - the link in your Github readme to https://umarbutler.com/open-australian-legal-corpus doesn't seem to go anywhere.

Thanks for the heads up! I've fixed that now.

> For someone interested in using the data (and help out with bugs/issues), where would you suggest starting?

I think the best place to start is by downloading the Corpus (visit https://huggingface.co/datasets/umarbutler/open-australian-l... , and then click "Files and versions" and then "corpus.jsonl"). You can then use my Python library orjsonl to parse the dataset (you'd run, `corpus = orjsonl.load('corpus.jsonl')`). At that point, there's any number of applications you could use the dataset for. You could pretrain a model like BERT, ELECTRA, etc... and share it on HuggingFace. You could connect the dataset to GPT and do RAG over it. Etc...