| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ubutler 970 days ago

> Just one note - the link in your Github readme to https://umarbutler.com/open-australian-legal-corpus doesn't seem to go anywhere.

Thanks for the heads up! I've fixed that now.

> For someone interested in using the data (and help out with bugs/issues), where would you suggest starting?

I think the best place to start is by downloading the Corpus (visit https://huggingface.co/datasets/umarbutler/open-australian-l... , and then click "Files and versions" and then "corpus.jsonl"). You can then use my Python library orjsonl to parse the dataset (you'd run, `corpus = orjsonl.load('corpus.jsonl')`). At that point, there's any number of applications you could use the dataset for. You could pretrain a model like BERT, ELECTRA, etc... and share it on HuggingFace. You could connect the dataset to GPT and do RAG over it. Etc...