Hacker News new | ask | show | jobs
by jasonthorsness 331 days ago
I think their vending machine project might need to succeed before you should trust Claude for investment advice:

https://www.anthropic.com/research/project-vend-1

Fun aside, finance and code can both depend critically on small details. Does finance have the same checks (linting, compiling, tests) that can catch problems in AI-generated code? I know Snowflake takes great pains to show whether queries generating reports are "validated" by humans or made up by AI, I think lots of people have these concerns.

5 comments

I disagree. Claude may fail at running a vending machine business but I have used it to read 10k reports and found it to be really good. There is a wealth of information in public filings that is legally required to be accurate but is often obfuscated in footnotes. I had an accounting professor that used to say the secret was reading (and understanding) the footnotes.

That’s a huge pain in the neck if you want to compare companies, worse if they are in different regulatory regimes. That’s the kind of thing I have found LLMs to be really good for.

For example, UnitedHealth buried in its financials that it hit its numbers by exiting equity positions.

It then _didn’t_ include a similar transaction (losing $7bn by exiting Brazil).

This was stuck in footnotes that many people who follow the company didn’t pick up.

https://archive.ph/fNX3b

how would someone using an LLM to explore the reports find such a thing
This is why it’s important to follow the studies comparing LLMs’ performance in “needle-in-a-haystack” style tasks. They tend to be pretty good at finding the one thing wrong in a large corpus of text, though it depends on the LLM, the flavor (Sonnet, Opus, 8B, 27B, etc) and the size of the corpus, and there are occasional performance cliffs.
Did you go and look at the correctness of the information?

Because I have seen Claude, as recently as a week ago, completely inventing and citing whole non existent paragraphs from the documentation of some software I know well. I only because of that, I was able to notice...

All models hallucinate. The likelihood of hallucinations are however strongly influenced by the way you prompt and construct your context.

But even if a human went through the documents by hand and tried to make the analysis, they're still likely to make mistakes. That's why we usually define the scientific method as making falsifiable claims, which you then try to disprove in order to make sure they're correct.

And if you can't do that, then you're always walking on thin ice, whatever tool or methodology you choose to use for the analysis.

> hallucinations are however strongly influenced by the way you prompt and construct your context.

Show me the research supporting this argument. So far RAG and similar approaches is what limits hallucinations.

Are you serious unaware what a RAG is and still speak with authority on the topic?

It's automatically retrieving information and adding it to the context. It's -in spirit- a convenience function so you don't have to manually provide it during the prompt. It's just a lot harder to pull off well automatically, but the fundamental practice is "just" context optimization

You're essentially saying "but that's not driving!!!!" After someone goes by in an EV, because it's ain't an ICE

Not the same: "RAG vs. Long-context LLMs" - https://www.superannotate.com/blog/rag-vs-long-context-llms
> I had an accounting professor that used to say the secret was reading (and understanding) the footnotes.

He must have passed this secret knowledge on, as they all say it now...

It's mostly good, but one mistake can burn you severely.
A good bit of old advice is to read the notes first.
would anyone pay for an llm that can parse 10k reports hallucination free?

was exploring this idea recently maybe I should ship it

Grok 4 SuperHeavy can almost certainly do this out of the box?
I haven't tried SuperHeavy, but why would it? all transformer based LLM's are pretty prone to hallucinations even with RAG... it can be pretty good I guess

any articles to learn more about it?

That part about Claude suddenly going all in on being a human wearing a blazer and red tie and then getting paranoid about the employees was actually rather terrifying. I got strong "allegedly self-driving car suddenly steering directly into a barrier" vibes at that point.
Claude 3.7 orders titanium cubes.

Claude 4 orders Melaniacoin ETF.

Financial modeling does have formatting norms, eg: different coloring for links, calculations, assumptions and inputs.

However one of the major ways people know their model is correct is by comparing the final metrics against publicly available ones, and if they are out of sync, going through the file to figure out why they didnt calculate correctly.

Personally, this is going to be the same boon/disaster as excel has been.

These tools are not getting used for investment advice in the sense of you might go seek out an advisor. It's used for first pass drafts of potential investments. Think deep research where the target is a company and the output is an investment thesis. There are a lot of rubbish companies out there looking for funding so any sort of automation to filter the volume of info down helps

>Does finance have the same checks

Nope. Closest is double entry system and that only prevents the most egregious stuff. It's the equivalent of you must close brackets in code...it's a constraint but the contents can still be hot garbage. For investment ideas that are literally zero guardrails, in fact quite the opposite as this demonstrates:

https://www.reddit.com/r/ChatGPT/comments/1k920cg/new_chatgp...