| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 547 days ago

I'm excited about this one - they seem to be directly targeting the "best model to run on a decent laptop" category, hence the comparison with Llama 3.3 70B and Qwen 2.5 32B.

I'm running it on a M2 64GB MacBook Pro now via Ollama and it's fast and appears to be very capable. This downloads 14GB of model weights:

  ollama run mistral-small:24b

Then using my https://llm.datasette.io/ tool (so I can log my prompts to SQLite):

  llm install llm-ollama
  llm -m mistral-small:24b "say hi"

More notes here: https://simonwillison.net/2025/Jan/30/mistral-small-3/

8 comments

simonw 547 days ago

The API pricing is notable too: they dropped the prices by half from the old Mistral Small - that one was $0.20/million tokens of input, $0.60/million for output.

The new Mistral Small 3 API model is $0.10/$0.30.

For comparison, GPT-4o-mini is $0.15/$0.60.

85392_school 547 days ago

Competition will likely be cheaper. (For context, Deepinfra runs larger 32B models at $0.07/$0.16)

isoprophlex 547 days ago

I make very heavy use of structured output (to convert unstructured data into something processable, eg for process mining on customer service mailboxes)

Is it any good for this, if you tested it?

I'm looking for something that hits the sweet spot of runs locally & follows prescribed output structure, but I've been quite underwhelmed so far

enkrs 547 days ago

I thought structured output is a solved problem now. I've had consistent results with ollama structured outputs [1] by passing Zod schema with the request. Works even with very small models. What are the challenges you're facing?

[1] https://ollama.com/blog/structured-outputs

freehorse 547 days ago

Structured output is solved, it is structuring data that's not, because that is an unbounded problem. There is no limit to how messy your data may be, and no limit to the accuracy and efficiency you may require.

I have used such models to structure human-generated data into sth a script can then read and process, getting important aspects in this data (eg what time the human reported doing X thing, how long, with whom etc) into like a csv file with columns eg timestamps and whatever variables I am interested in.

Der_Einzige 547 days ago

For anyone who thinks it isn't "solved", outlines debunked the paper which claims that "structured generation harms creativity":

https://blog.dottxt.co/say-what-you-mean.html

the_mitsuhiko 547 days ago

I get decent JSON from it quite well with the "assistant: {" trick. I'm not sure how well trained it is to do JSON. The template on ollama has tools calls so I assume they made sure JSON works: https://ollama.com/library/mistral-small:24b/blobs/6db27cd4e...

a_wild_dandan 547 days ago

And for anyone looking to dig deeper, check out "grammar-based sampling."

azinman2 547 days ago

What’s the “assistant: {" trick? You just end your prompt with that?

simonw 547 days ago

Mistral supports prefixes: https://docs.mistral.ai/guides/prefix/

azinman2 547 days ago

That’s cool. However it only shows a few odd. I’d imagine the model needs to explicitly support this (be trained with it). None are about json… do you use that trick with json?

simonw 546 days ago

I use it to get JSON pretty often. See also: https://twitter.com/simonw/status/1885091289554968975

starik36 547 days ago

The only model that I've found to be useful in processing customer emails is o1-preview. The rest of the models work as well, but don't get all the minutia of the emails.

My scenario is pretty specific though and is all about determining intent (e.g. what does the customer want) and mapping it onto my internal structures.

The model is very slow, but definitely worth it.

d4rkp4ttern 546 days ago

It does decently well actually. You can test function-calling using Langroid. There are several example scripts you could try from the repo, e.g.

    uv run examples/basic/tool-extract-short-example.py --model ollama/mistral-small

sample output: https://gist.github.com/pchalasani/662d7f13dbe690d6e2bfef01c...

Langroid has a ToolMessage mechanism that lets you specify a tool/fn-call using Pydantic, which is then transpiled into system message instructions.

mohsen1 547 days ago

See function calling being called out here

https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...

mercer 547 days ago

I've found phi4 to be very good for this.

rkwz 547 days ago

What local models are you currently using and what issues are you facing?

halyconWays 547 days ago

Maybe I'm an outlier but I don't see much value in running tiny local models vs. using a more powerful desktop in my house to host a larger and far more usable model. I run Open WebUI and connect it to my own llama.cpp/koboldcpp that runs a 4-bit 70B model, and can connect to it anywhere easily with Tailscale. For questions that even 70B can't handle I have Open WebUI hit OpenRouter and can choose between all the flagship models.

Every time I've tried a tiny model it's been too questionable to trust.

kamranjon 547 days ago

Have you tried Gemma 27b? I’ve been using it with llamafile and it’s pretty incredible. I think the winds are changing a bit and small models are becoming much more capable. Worth giving some of the smaller ones a shot if it’s been a while. I can run Gemma 27b on my 32gb MacBook Pro and it’s pretty capable with code too.

pks016 547 days ago

Question for people who spent more time with these small models. What's a current best small model to extract information from a large number of pdfs? I have multiple collection of research articles. I want two tasks 1) Extract info from pdfs 2) classify papers based the content of the paper.

Or point me to right direction

themanmaran 547 days ago

Hey this is something we know a lot about. I'd say Qwen 2.5 32B would be the best here.

We've found GPT-4o/Claude 3.5 to benchmark at around 85% accuracy on document extraction. With Qwen 72B at around 70%. Smaller models will go down from there.

But it really depends on the complexity of the documents, and how much information you're looking to pull out. Is it something easy like document_title or hard like array_of_all_citations.

pks016 547 days ago

Most of them are experimental studies. So it would be text extraction of something like title, authors, species of the study, sample size etc. And classify based on the content of the pdfs.

I tried the GPT-4o, it's good but it'll cost a lot if I want to process all the documents.

SparkyMcUnicorn 547 days ago

1. You can get a 50% discount via batching.

2. Give a few Sonnet or 4o input/output examples to haiku, 4o-mini, or any other smaller model. Giving good examples to smaller models can bring the output quality closer to (or on par with) the better model.

rahimnathwani 547 days ago

Given you have 64GB RAM, you could run mistral-small:24b-instruct-2501-q8_0

jhickok 547 days ago

Do you know how many tokens per second you are getting? I have a similar laptop that I can test on later but if you have that info handy let me know!

snickell 547 days ago

M2 max with 64GB: 14 tokens/s running `ollama run mistral-small:24b --verbose`

prettyblocks 547 days ago

Hey Simon - In your experience, what's the best "small" model for function/tool calling? Of the ones I've tested they seem to return the function call even when it's not needed, which requires all kinds of meta prompting & templating to weed out. Have you found a model that more or less just gets it right?

simonw 547 days ago

I'm afraid I don't have a great answer for that - I haven't spent nearly enough time with function calling in these models.

I'm hoping to add function calling to my LLM library soon which will make me much better equipped to experiment here.

prettyblocks 547 days ago

Cool, thanks for the reply. Looking forward to following along with your progress.

jonas21 547 days ago

I don't understand the joke.

simonw 547 days ago

It's hardly a joke at all. Even the very best models tend to be awful at writing jokes.

I find the addition of an explanation at the end (never a sign of a good joke) amusing at the meta-level:

  Why did the badger bring a puffin to the party?

  Because he heard puffins make great party 'Puffins'!

  (That's a play on the word "puffins" and the phrase "party people.")

dgacmu 547 days ago

A man walks up to an llm and asks it to tell him a joke about a puffin and a badger.

The llm replies with a joke that is barely a joke.

The man says "another."

The llm gives another unfunny response.

"Another!"

Followed by another similarly lacking response.

"Another!"

With exasperation, the llm replies "stop badgering me!"

Except it won't, because that's not a high likelihood output. ;)

airstrike 547 days ago

Now that your comment is in their training data, it's a high likelihood output!

becquerel 547 days ago

And yet LLMs tend to in fact be very funny, just very very rarely on purpose.

emmelaich 547 days ago

Apparently "party puffin" is a company that sells cheap party supplies and decorations. That's all that I can think of.