Hacker News new | ask | show | jobs
Training a Model to Extract Sections from Legal Documents
1 points by Philosophia 735 days ago
Hi folks - I’m looking to train a model that can review legal documents and extract specific sections from them. Here are the main challenges I’m facing: * Varied Document Length: These filings can range from a few pages to hundreds of pages. * Inconsistent Headers: The section headers aren’t consistent. For example, the same section might be titled “Claim,” “Defendant’s Claim,” “Defendant’s Argument,” or “Main Argument.” The tool needs to identify the section based on the content itself, not just the header. * Identifying End Points: The model needs to know where a section ends, either at the next section header or when unrelated details begin (sometimes right after the paragraphs we want). It should be able to figure out the end point based on the context of the following paragraphs. I know I might not be able to fully automate this process, but I’m looking for a way to get as close as possible without needing a lot of manual input. I need to handle ~1000 of documents, so efficiency is key. From what I understand, I have a couple of options: * Fine-tuning BERT for tasks like Named Entity Recognition to pinpoint the sections. * Using a Llama 3-like model that can handle longer contexts and work well with few-shot or zero-shot learning. Any advice or guidance would be greatly appreciated! I’ve been going crazy trying to solve this, so any help would be a lifesaver.
1 comments

There's probably not much advice here because you seem to already have a handle on it.

Go slow and clumsy with your process before fine tuning your approach. Embrace some manual labour at first to provide yourself with a high quality validation set of docs which you can compare the performance of your approaches. (but be wary of your approach overfitting even your validation set as you tweak at prompts).

Thanks @ac2u. Appreciate the insight.