Checking is different from finding, though. Source checking means just "verify that this information is actually present in that document". Much harder to hallucinate in this case.
"Follow each link in this document. Read each link's contents against the contents in this document. Create a report: for each link list a working hyperlink, whether it exists, what claim it supports, whether it supports or fails to support it, and why"
If it returns a report claiming all correct? That's promising, but human verification is important. You've got a list of hyperlinks, and a list of claims; so you can click each with middle-mouse, Ctrl-F 'till you find the point, and close the tab when you do.
If you find any discrepancies ? Your initial prompt was malformed and/or you picked the wrong LLM, the wrong human, or possibly all three. Whatever the way, the results are built on quicksand; you'll need to start over.
If no sources are provided? Well now: "If there ain't no sources it never happened."
Compare double-entry bookkeeping. It needs to all add up. If you're 1 cent off, that means something is broken. Idem if a single reference is off, it polluted the context. (This works for human-generated and hybrid documents too. Polluted reasoning is polluted reasoning. The process is what counts.)
Gemini: The article focuses on the environmental and human labor costs of scaling Artificial Intelligence, specifically focusing on water usage, electricity, and "ghost work."
Which is hilarious, since the article doesn't even mention the words "water" or "electricity." Gemini remains unfazed, reporting the links that are not in the article (some don't exist at all) to make the final ruling: "The Tech Trenches document is highly accurate in its citations."
Now, I know. Had I used Claude Code with relevant skills, it would have done better. But would it be good?
Right, so https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... is actually a substack, gemini is blocked from accessing it, and is bouncing off and hallucinating instead. Ok, that's an actual bug, that should not lead to the model starting to hallucinate. Imo the correct response should have been to fail loudly; which would have been a verification signal of its own.
ps: See also: https://news.ycombinator.com/item?id=48087485 ... I'm starting to think of it as "english is a new scripting language". Clearly the downside is that certain "runtime environments" are not compatible. %-/
https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... "Follow each link in this document. Read each link's contents against the contents in this document. Create a report: for each link list a working hyperlink, whether it exists, what claim it supports, whether it supports or fails to support it, and why. If unable to fetch the initial document, Stop and report failure."
And now it errors out on gemini.google.com. . This is like early days unix scripting; I didn't add the equivalent of "#!/bin/bash -euo pipefail" ; and I didn't catch it because most systems already include something like it in their ".bashrc" (system prompt or weights) anyway.
This is so frustrating. I'm sorry. It's like the 1980's 8 bit era again, some systems actually work, others are terrible, and I didn't realize it can be like this for some folks. You could come away with the conclusion that this whole "computer" thing is all just a fad that'll never amount to anything. (meanwhile , the program works perfectly on my own machine, right over here of course %-) )
Claude web finds a number of small discrepancies in the sources, which I manually crosschecked and seem consistent with a human mixing things up slightly.
+ I also tested in gemini 3 flash preview, which generates an actual table (twice). It doesn't flag any discrepancies, which is consistent with it being a weaker model. But the urls and claims are listed and line up, so you've got your verification table to work with. (it's a semantic formatting task, so that part would be hard to mess up)
+ Summary of results. https://claude.ai/public/artifacts/10a42111-a0ee-42f3-b6d2-a... All of the models extracted the URLs into a table just fine, and that part at least is a lot easier than writing a perl script used to be in the '90s ;-) . The first part is the important bit so you as a human can "check your fucking sources". The second part the models handle variously, each does find discrepancies. None of them find all of them, but that makes sense: this is a fairly polished piece and it ideally shouldn't have discrepancies at all to begin with.
So: it worked as a smoke check just fine in the above. Doing more than a quick smoke check obviously requires a somewhat more involved procedure.
"Follow each link in this document. Read each link's contents against the contents in this document. Create a report: for each link list a working hyperlink, whether it exists, what claim it supports, whether it supports or fails to support it, and why"
If it returns a report claiming all correct? That's promising, but human verification is important. You've got a list of hyperlinks, and a list of claims; so you can click each with middle-mouse, Ctrl-F 'till you find the point, and close the tab when you do.
If you find any discrepancies ? Your initial prompt was malformed and/or you picked the wrong LLM, the wrong human, or possibly all three. Whatever the way, the results are built on quicksand; you'll need to start over.
If no sources are provided? Well now: "If there ain't no sources it never happened."
Compare double-entry bookkeeping. It needs to all add up. If you're 1 cent off, that means something is broken. Idem if a single reference is off, it polluted the context. (This works for human-generated and hybrid documents too. Polluted reasoning is polluted reasoning. The process is what counts.)