| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sdrinf 99 days ago

Just want to echo the recommendation for qwen3.5:9b. This is a smol, thinking, agentic tool-using, text-image multimodal creature, with very good internal chains of thought. CoT can be sometimes excessive, but it leads to very stable decision-making process, even across very large contexts -something we haven't seen models of this size before.

What's also new here, is VRAM-context size trade-off: for 25% of it's attention network, they use the regular KV cache for global coherency, but for 75% they use a new KV cache with linear(!!!!) memory-token-context size expansion! which means, eg ~100K token -> 1.5gb VRAM use -meaning for the first time you can do extremely long conversations / document processing with eg a 3060.

Strong, strong recommend.

6 comments

steve_adams_86 99 days ago

I've been building a harness for qwen3.5:9b lately (to better understand how to create agentic tools/have fun) and I'm not going to use it instead of Opus 4.6 for my day job but it's remarkably useful for small tasks. And more than snappy enough on my equipment. It's a fun model to experiment with. I was previously using an old model from Meta and the contrast in capability is pretty crazy.

I like the idea of finding practical uses for it, but so far haven't managed to be creative enough. I'm so accustomed to using these things for programming.

tempoponet 99 days ago

What kind of small tasks do you find it's good at? My non-coding use of agents has been related to server admin, and my local-llm use-case is for 24/7 tasks that would be cost-prohibitive. So my best guess for this would be monitoring logs, security cameras, and general home automation tasks.

steve_adams_86 99 days ago

That's about it. The harness is still pretty rudimentary so I'm sure the system could be more capable, and that might reveal more interesting opportunities. I don't really know.

So far I've got it orchestrating a few instances to dig through logs, local emails, git repositories, and github to figure out what I've been doing and what I need to do. Opus is waayyy better at it, but Qwen does a good enough job to actually be useful.

I tried having it parse orders in emails and create a CSV of expenses, and that went pretty badly. I'm not sure why. The CSV was invalid and full of bunk entries by the end, almost every time. It missed a lot of expenses. It would parse out only 5 or 6 items of 7, for example. Opus and Sonnet do spectacular jobs on tasks like this, and do cool things like create lists of emails with orders then systematically ensure each line item within each email is accounted for, even without prompting to do so. It's an entirely different category of performance.

Automation is something I'd like to dabble in next, but all I can think of it being useful for is mapping commands (probably from voice) to tool calls, and the reality is I'd rather tap a button on my phone. My family might like being able to use voice commands, though. Otherwise, having it parse logs to determine how to act based on thresholds or something would also be far better implemented with simple algorithms. It's hard to find truly useful and clear fits for LLMs

novok 99 days ago

Oh man you just gave me an idea to use something like qwen 3.5 to categorize a lot of emails. You can keep the context small, do it per email and just churn through a lot of crap.

The 0.8B can do this pretty well.

Actually pg's original "A plan for spam" explains how to do this with a Bayesian classifier.

steve_adams_86 98 days ago

I've been learning to apply these lately and it has been pretty eye opening. Combined with Fourier analysis (for example) you can do what seems kind of like magic, in my opinion. But it has been possible since long before LLMs showed up.

Totally different categories and different use cases, but the more I learn about LLMs the more I discover there's a powerful, determinsitic, well-established statistical model or two to do the same thing.

Really, LLMs are kind of like convenient, wildly inefficient proxies for useful processes. But I'm not convinced they should often end up as permanent fixtures of logical pipelines. Unless you're making a chat bot, I guess.

alexpotato 98 days ago

I was just chatting with a co-worker that wanted to run a LLM locally to classify a bunch of text. He was worried about spending too many tokens though.

I asked him why he didn't just have the LLM build him a python ML library based classifier instead.

The LLMs are great but you can also build supporting tools so that:

- you use fewer tokens

- it's deterministic

- you as the human can also use the tools

- it's faster b/c the LLM isn't "shamboozling" every time you need to do the same task.

nunodonato 99 days ago

you can use 4B for that, its quite good

threecheese 99 days ago

You can really see the limitations of qwen3.5:9b in reasoning traces- it’s fascinating. When a question “goes bad”, sometimes the thinking tokens are WILD - it’s like watching the Poirot after a head injury.

Example: “what is the air speed velocity of a swallow?” - qwen knew it was a Monty Python gag, but couldnt and didnt figure out which one.

scottmf 99 days ago

As a person who also knows there's a connection between that phrase and Monty Python and not much more information beyond that, I'm not sure how to feel.

cassianoleal 99 days ago

African or European?

plank 99 days ago

My favourite colour is blue. Oh, no, it is...

8note 98 days ago

could that be some of the RL trying to get it to not regurgitate?

the gag is giving in detail which one

threecheese 98 days ago

https://gist.github.com/mikewaters/7ebfbc73eb8624f917c5b4167...

It thinks like it’s memory is broken and it’s unaware of it; over 100 lines like this:

    - Wait, no, that's not right either.
    - Let's recall the specific line. It goes like this:
        - Knight A: "How can you have a swallow?"
        - Knight B: "It is the air speed velocity of a swallow."
        - Actually, the most common citation is from the movie where they ask an expert on swallows? No.

kingo55 99 days ago

How's it compare in quality with larger models in the same series? E.g 122b?

buzzin_ 99 days ago

The chart on this link compares all qwen3.5 models down to 0.8B.

https://www.reddit.com/r/LocalLLaMA/comments/1ro7xve/qwen35_...

ggsp 99 days ago

How much difference are you seeing between standard and Q4 versions in terms of degradation, and is it constant across tasks or more noticeable in some vs others?

rnewme 99 days ago

Less than expected, search for unsloths recent benchmark

RRRA 98 days ago

I'd be curious to see people give their opinion on embedded models for less tech focused needs, say what's that bug killing spray chemistry like or what is the history of this or that...

I'd also be curious to see if people have started doing censorship analysis of various models, like Qwen differing Tiananmen square to government documments while Llama straights up answers the question.

boppo1 98 days ago

Is qwen 3.5 any good for chatting? I use chatgpt for 'light therapy' (basically sounding out confusing social situations my friends don't want to walk me through) and it's honestly been amazing. But I would rather not give all that to openai.