| Testing AI systems is challenging. We ran into this question collaborating on an AI system that revises sales contracts. Some of the questions we had to answer were: - What makes the advice output by an AI assistant "good legal advice"? - How do we break down the output of our system into discrete steps that we can test? - How do we map each of those discrete steps into our definition of "good legal advice" to make it measurable? To do this we had to come up with a process, starting from breaking down the AI legal advisor into testable components, through transforming open ended legal and usability questions into measurable quantities, and concluding with writing unit tests using Vitest and Poyro (https://github.com/poyro/poyro), a Vitest plugin we built, to find where the system did not align with expectations. The steps we ended up following to come up with the tests are applicable to non-legal AI apps. The link (https://docs.poyro.dev/essays/unit-testing-a-legal-ai-app) provides runnable code examples for these tests that you can play with. Hope you find this interesting, we found it insightful and fun to walk through a concrete use case end to end. |