Hacker News new | ask | show | jobs
by simonw 500 days ago
I'm excited about this one - they seem to be directly targeting the "best model to run on a decent laptop" category, hence the comparison with Llama 3.3 70B and Qwen 2.5 32B.

I'm running it on a M2 64GB MacBook Pro now via Ollama and it's fast and appears to be very capable. This downloads 14GB of model weights:

  ollama run mistral-small:24b
Then using my https://llm.datasette.io/ tool (so I can log my prompts to SQLite):

  llm install llm-ollama
  llm -m mistral-small:24b "say hi"
More notes here: https://simonwillison.net/2025/Jan/30/mistral-small-3/
8 comments

The API pricing is notable too: they dropped the prices by half from the old Mistral Small - that one was $0.20/million tokens of input, $0.60/million for output.

The new Mistral Small 3 API model is $0.10/$0.30.

For comparison, GPT-4o-mini is $0.15/$0.60.

Competition will likely be cheaper. (For context, Deepinfra runs larger 32B models at $0.07/$0.16)
I make very heavy use of structured output (to convert unstructured data into something processable, eg for process mining on customer service mailboxes)

Is it any good for this, if you tested it?

I'm looking for something that hits the sweet spot of runs locally & follows prescribed output structure, but I've been quite underwhelmed so far

I thought structured output is a solved problem now. I've had consistent results with ollama structured outputs [1] by passing Zod schema with the request. Works even with very small models. What are the challenges you're facing?

[1] https://ollama.com/blog/structured-outputs

Structured output is solved, it is structuring data that's not, because that is an unbounded problem. There is no limit to how messy your data may be, and no limit to the accuracy and efficiency you may require.

I have used such models to structure human-generated data into sth a script can then read and process, getting important aspects in this data (eg what time the human reported doing X thing, how long, with whom etc) into like a csv file with columns eg timestamps and whatever variables I am interested in.

For anyone who thinks it isn't "solved", outlines debunked the paper which claims that "structured generation harms creativity":

https://blog.dottxt.co/say-what-you-mean.html

I get decent JSON from it quite well with the "assistant: {" trick. I'm not sure how well trained it is to do JSON. The template on ollama has tools calls so I assume they made sure JSON works: https://ollama.com/library/mistral-small:24b/blobs/6db27cd4e...
And for anyone looking to dig deeper, check out "grammar-based sampling."
What’s the “assistant: {" trick? You just end your prompt with that?
Mistral supports prefixes: https://docs.mistral.ai/guides/prefix/
That’s cool. However it only shows a few odd. I’d imagine the model needs to explicitly support this (be trained with it). None are about json… do you use that trick with json?
I use it to get JSON pretty often. See also: https://twitter.com/simonw/status/1885091289554968975
The only model that I've found to be useful in processing customer emails is o1-preview. The rest of the models work as well, but don't get all the minutia of the emails.

My scenario is pretty specific though and is all about determining intent (e.g. what does the customer want) and mapping it onto my internal structures.

The model is very slow, but definitely worth it.

It does decently well actually. You can test function-calling using Langroid. There are several example scripts you could try from the repo, e.g.

    uv run examples/basic/tool-extract-short-example.py --model ollama/mistral-small
sample output: https://gist.github.com/pchalasani/662d7f13dbe690d6e2bfef01c...

Langroid has a ToolMessage mechanism that lets you specify a tool/fn-call using Pydantic, which is then transpiled into system message instructions.

See function calling being called out here

https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...

I've found phi4 to be very good for this.
What local models are you currently using and what issues are you facing?
Maybe I'm an outlier but I don't see much value in running tiny local models vs. using a more powerful desktop in my house to host a larger and far more usable model. I run Open WebUI and connect it to my own llama.cpp/koboldcpp that runs a 4-bit 70B model, and can connect to it anywhere easily with Tailscale. For questions that even 70B can't handle I have Open WebUI hit OpenRouter and can choose between all the flagship models.

Every time I've tried a tiny model it's been too questionable to trust.

Have you tried Gemma 27b? I’ve been using it with llamafile and it’s pretty incredible. I think the winds are changing a bit and small models are becoming much more capable. Worth giving some of the smaller ones a shot if it’s been a while. I can run Gemma 27b on my 32gb MacBook Pro and it’s pretty capable with code too.
Question for people who spent more time with these small models. What's a current best small model to extract information from a large number of pdfs? I have multiple collection of research articles. I want two tasks 1) Extract info from pdfs 2) classify papers based the content of the paper.

Or point me to right direction

Hey this is something we know a lot about. I'd say Qwen 2.5 32B would be the best here.

We've found GPT-4o/Claude 3.5 to benchmark at around 85% accuracy on document extraction. With Qwen 72B at around 70%. Smaller models will go down from there.

But it really depends on the complexity of the documents, and how much information you're looking to pull out. Is it something easy like document_title or hard like array_of_all_citations.

Most of them are experimental studies. So it would be text extraction of something like title, authors, species of the study, sample size etc. And classify based on the content of the pdfs.

I tried the GPT-4o, it's good but it'll cost a lot if I want to process all the documents.

1. You can get a 50% discount via batching.

2. Give a few Sonnet or 4o input/output examples to haiku, 4o-mini, or any other smaller model. Giving good examples to smaller models can bring the output quality closer to (or on par with) the better model.

Given you have 64GB RAM, you could run mistral-small:24b-instruct-2501-q8_0
Do you know how many tokens per second you are getting? I have a similar laptop that I can test on later but if you have that info handy let me know!
M2 max with 64GB: 14 tokens/s running `ollama run mistral-small:24b --verbose`
Hey Simon - In your experience, what's the best "small" model for function/tool calling? Of the ones I've tested they seem to return the function call even when it's not needed, which requires all kinds of meta prompting & templating to weed out. Have you found a model that more or less just gets it right?
I'm afraid I don't have a great answer for that - I haven't spent nearly enough time with function calling in these models.

I'm hoping to add function calling to my LLM library soon which will make me much better equipped to experiment here.

Cool, thanks for the reply. Looking forward to following along with your progress.
I don't understand the joke.
It's hardly a joke at all. Even the very best models tend to be awful at writing jokes.

I find the addition of an explanation at the end (never a sign of a good joke) amusing at the meta-level:

  Why did the badger bring a puffin to the party?

  Because he heard puffins make great party 'Puffins'!

  (That's a play on the word "puffins" and the phrase "party people.")
A man walks up to an llm and asks it to tell him a joke about a puffin and a badger.

The llm replies with a joke that is barely a joke.

The man says "another."

The llm gives another unfunny response.

"Another!"

Followed by another similarly lacking response.

"Another!"

With exasperation, the llm replies "stop badgering me!"

Except it won't, because that's not a high likelihood output. ;)

Now that your comment is in their training data, it's a high likelihood output!
And yet LLMs tend to in fact be very funny, just very very rarely on purpose.
Apparently "party puffin" is a company that sells cheap party supplies and decorations. That's all that I can think of.