|
|
|
|
|
by itrummer
247 days ago
|
|
We use mocking to replace actual LLM calls when testing for the correctness of the ThalamusDB code. In terms of performance benchmarking, we ran quite a few experiments measuring time, costs (fees for LLM calls), and result accuracy. The latter one is the hardest to evaluate since we need to compare the ThalamusDB results to the ground truth. Often, we used data sets from Kaggle that come with manual labels (e.g., camera trap pictures labeled with the animal species, then we can get ground truth for test queries that count the number of pictures showing specific animals). |
|
A problem I have with LLMs and the way they are marketed is that are being treated as and offered as if they were toys.
You’ve given a few tantalizing details, but what I would really admire is a link to full details about exactly what you did to collect sufficient evidence that this system can be trusted and in what ways it can be trusted.