| You just have to double check the results whenever you tell Claude to extract lists and data. 99.9% of it will be correct, but sometimes 1 or 2 records are off. This kind of error is especially hard to notice because you're so impressed that Claude managed to do the extract task at all -- plus the results look wholly plausible upon eyeballing -- that you wouldn't expect anything to be wrong at all. But LLMs can get things ever slightly wrong when it comes to long lists/tables. I've been bitten by this before. Trust but verify. (edit: if the answer is "machine verifiable", one approach to ask an LLM to write a Python validator which it can execute internally. ChatGPT can execute code. I believe Sonnet 3.5 can too, but I haven't tried.) |
It's one thing to ask an LLM when George Washington was born, and have it return "May 20, 2020." It's another thing to ask it, and have it matter-of-factly hallucinate "February 20, 1733." At first glance, that... sounds right, right? President's Day is in February, and has something to do with his birthday? And that year seems to check out? Good enough!
But it's not right. And it's the confidence and bravado with which LLMs report these "facts" that's terrifying. It just misstates information, calculations, and detail work, because the stochastic model compelled it to, and there wasn't sufficient checks in place to confirm or validate the information.
Trust but verify is one of those things that's so paradoxical and cyclical: if I have to confirm every fact ChatGPT gives me with... what I hope is a higher source of truth like Wikipedia, before it's overrun with LLM outputs... then why don't I just start there? If I have to build a validator in Python to verify the output then... why not just start there?
We're going to see some major issues crop up from this sort of insidious error, but the hard part about off-by-ones is that they're remarkably difficult to detect, and so what will happen is data will slowly corrupt and take us further and further off course, and we won't notice until it's too late. We should be so lucky that all of LLMs' garbage outputs look like glue on pizza recommendations, but the reality is, it'll be a slow, seeping poisoning of the well, and when this inaccurate output starts sneaking into parts of our lives that really matter... we're probably well and truly fucked.