| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Kim_Bruning 34 days ago

A quick smoke check takes just a few minutes.

"Follow each link in this document. Read each link's contents against the contents in this document. Create a report: for each link list a working hyperlink, whether it exists, what claim it supports, whether it supports or fails to support it, and why"

If it returns a report claiming all correct? That's promising, but human verification is important. You've got a list of hyperlinks, and a list of claims; so you can click each with middle-mouse, Ctrl-F 'till you find the point, and close the tab when you do.

If you find any discrepancies ? Your initial prompt was malformed and/or you picked the wrong LLM, the wrong human, or possibly all three. Whatever the way, the results are built on quicksand; you'll need to start over.

If no sources are provided? Well now: "If there ain't no sources it never happened."

Compare double-entry bookkeeping. It needs to all add up. If you're 1 cent off, that means something is broken. Idem if a single reference is off, it polluted the context. (This works for human-generated and hybrid documents too. Polluted reasoning is polluted reasoning. The process is what counts.)

2 comments

flail 34 days ago

A quick smoke test, then. Gemini 3, Thinking Mode. The article: https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... The prompt: literally what you suggested.

Gemini: The article focuses on the environmental and human labor costs of scaling Artificial Intelligence, specifically focusing on water usage, electricity, and "ghost work."

Which is hilarious, since the article doesn't even mention the words "water" or "electricity." Gemini remains unfazed, reporting the links that are not in the article (some don't exist at all) to make the final ruling: "The Tech Trenches document is highly accurate in its citations."

Now, I know. Had I used Claude Code with relevant skills, it would have done better. But would it be good?

link

Kim_Bruning 34 days ago

Ah! I finally got you somewhat replicated! It's https://gemini.google.com , when you use the free model.

* https://gemini.google.com/share/6bd33176b27c

Right, so https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... is actually a substack, gemini is blocked from accessing it, and is bouncing off and hallucinating instead. Ok, that's an actual bug, that should not lead to the model starting to hallucinate. Imo the correct response should have been to fail loudly; which would have been a verification signal of its own.

ps: See also: https://news.ycombinator.com/item?id=48087485 ... I'm starting to think of it as "english is a new scripting language". Clearly the downside is that certain "runtime environments" are not compatible. %-/

link

Kim_Bruning 34 days ago

https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... "Follow each link in this document. Read each link's contents against the contents in this document. Create a report: for each link list a working hyperlink, whether it exists, what claim it supports, whether it supports or fails to support it, and why. If unable to fetch the initial document, Stop and report failure."

And now it errors out on gemini.google.com. . This is like early days unix scripting; I didn't add the equivalent of "#!/bin/bash -euo pipefail" ; and I didn't catch it because most systems already include something like it in their ".bashrc" (system prompt or weights) anyway.

This is so frustrating. I'm sorry. It's like the 1980's 8 bit era again, some systems actually work, others are terrible, and I didn't realize it can be like this for some folks. You could come away with the conclusion that this whole "computer" thing is all just a fad that'll never amount to anything. (meanwhile , the program works perfectly on my own machine, right over here of course %-) )

link

Kim_Bruning 34 days ago

> Now, I know. Had I used Claude Code with relevant skills, it would have done better. But would it be good?

Wait. Why do I suddenly suspect you were on to me this whole time?

Very Well. Here's a skill that does the thing; you tell me: https://vps.kimbruning.nl/link-verifier.skill

While building, I realized I could actually make the whole thing a lot better, and really dig into sources. But... it's a start.

+ Output on your url. Ugly, but works: https://claude.ai/public/artifacts/d465a07b-378c-4089-b885-6...

link

simianwords 34 days ago

Gemini is famously bad at these things. Try using ChatGPT.

link

Kim_Bruning 34 days ago

Interesting! Where did you apply it? Can you show your output in more detail?

It's more like a small script, and it's supposed to extract urls and generate a table.

Here's my result in Claude Web for comparison:

https://claude.ai/public/artifacts/d76936f2-c97b-4bff-9205-2...

Claude web finds a number of small discrepancies in the sources, which I manually crosschecked and seem consistent with a human mixing things up slightly.

+ I also tested in gemini 3 flash preview, which generates an actual table (twice). It doesn't flag any discrepancies, which is consistent with it being a weaker model. But the urls and claims are listed and line up, so you've got your verification table to work with. (it's a semantic formatting task, so that part would be hard to mess up)

+ Gemini 3.1 pro yields a fairly aggressive report. https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

+ ChatGPT free (specific model not listed) needed 2 tries, didn't properly follow the prompt even then. I guess I got what I paid for, and I needed to download; https://vps.kimbruning.nl/productivity/Ai%20Productivity%20A... (pdf), https://vps.kimbruning.nl/productivity/ai_productivity_artic... (md)

+ Kimi K2.6 instant: https://www.kimi.com/share/19e2cc40-d012-89bf-8000-00006267f...

+ Summary of results. https://claude.ai/public/artifacts/10a42111-a0ee-42f3-b6d2-a... All of the models extracted the URLs into a table just fine, and that part at least is a lot easier than writing a perl script used to be in the '90s ;-) . The first part is the important bit so you as a human can "check your fucking sources". The second part the models handle variously, each does find discrepancies. None of them find all of them, but that makes sense: this is a fairly polished piece and it ideally shouldn't have discrepancies at all to begin with.

So: it worked as a smoke check just fine in the above. Doing more than a quick smoke check obviously requires a somewhat more involved procedure.

link

throw310822 34 days ago

I would love to do it at scale on many online publications, and publish the results. That would teach 'em.

link