| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by theanonymousone 607 days ago
	May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :/

11 comments

com2kid 606 days ago

3B models are perfectly capable, I've had great luck with Phi 3.5.

> For example, they seem to not care about instructions to only write a response and no explanation

You need to use tools to force the model to adhere to a schema. Or you can learn to parse out the part of the response you want, both work.

You'll also need to make good use of robust examples in your initial prompt, and give lots of examples of how you want the output to look. (Yes this quickly burns up the limited context length!)

Finally, embrace the fact that these models are tuned for chat, so the more conversational you make the back and forth the less you are stretching the models abilities.

I wrote a very small blog post at https://meanderingthoughts.hashnode.dev/unlock-the-full-pote... explaining some of this.

teleforce 606 days ago

I wonder if CUE can help the situation in similar fashion to the DSL methods that you've described in your blog post [1]. After all CUE fundamentals are based on feature structure from the deterministic approach of NLP unlike LLM that's stochastic NLP [2],[3]. Perhaps deterministic and non-deterministic approaches is the potent combination that can effectively help reduce much of the footprint to get to the same results and being energy efficient in the process.

[1] Cue – A language for defining, generating, and validating data:

https://news.ycombinator.com/item?id=20847943

[2] Feature structure:

https://en.m.wikipedia.org/wiki/Feature_structure

[3] The Logic of CUE:

https://cuelang.org/docs/concept/the-logic-of-cue/

com2kid 606 days ago

On my LinkedIn post about this topic someone actually replied with a superior method of steering LLM output compared to anything else I've ever heard of, so I've decided that until I find time to implement their method, I'm not going to worry about things.

tl;dr you put into the prompt all the JSON up until what you want the LLM to say, and you set the stop token to the end token of the current JSON item (so ',' or '}' ']', whatever) and you then your code fills out the rest of the JSON syntax up until another LLM generated value is needed.

I hope that makes sense.

It is super cool, and I am pretty sure there is a way to make a generator that takes in an arbitrary JSON schema and builds a state machine to do the above.

The performance should be super fast on locally hosted models that are using context caching.

Eh I should write this up as a blog post, hope someone else implements it, and if not, just do it myself.

shawnz 606 days ago

There are many solutions for constrained/structured generation with LLMs these days, here is a blog post my employer published about this a while back: https://monadical.com/posts/how-to-make-llms-speak-your-lang...

I'm partial to Outlines lately, but they all have various upsides and downsides.

OpenAI even natively added support for this on their platform recently: https://openai.com/index/introducing-structured-outputs-in-t...

hedgehog 606 days ago

This is a really good post. I did find one error, Instructor works well with at least one other back end (Ollama).

Outlines looks quite interesting but I wasn't able to get it to work reliably.

zackangelo 606 days ago

With mixlayer, because the round trip time to the model is so short, you can alternate between appending known tokens of the JSON output and values you want the model to generate. I think this works better than constraining the sampling in a lot of cases.

We haven’t built a state machine over JSON schema that uses this approach yet but it’s on the way.

com2kid 606 days ago

> With mixlayer, because the round trip time to the model is so short, you can alternate between appending known tokens of the JSON output and values you want the model to generate. I think this works better than constraining the sampling in a lot of cases.

Wow, that is a much more succinct way of describing it!

> We haven’t built a state machine over JSON schema that uses this approach yet but it’s on the way.

Really this should just be a simple library in JS and Python. Schema goes in, state machine pops out.

Complications will be around optional fields, I'm not sure offhand how to solve that!

zackangelo 606 days ago

I'd love it if you checked out what we've been working on.

It's still in early stages, but might be usable for something you're trying to build. Here's an example (this buffers the entire JSON object, but you can also gen as you go): https://docs.mixlayer.com/examples/json-output

wswope 607 days ago

I’ve only toyed with them a bit, and had a similar experience - but did find I got better output by forcing them to adhere to a fixed grammar: https://github.com/ggerganov/llama.cpp/tree/master/grammars

For context, I was playing with a script to bulk download podcasts, transcribe with whisper, pass the transcription to llama.cpp to ID ads, then slice the ads out with ffmpeg. I started with the generic json_array example grammar, then iteratively tweaked it.

beoberha 606 days ago

For me, it was almost random if I would get a little spiel at the beginning of my response - even on the unquantized 8b instruct. Since ollama doesn’t support grammars, I was trying to get it to work where I had a prompt that summarized an article and extracted and classified certain information that I requested. Then I had another prompt that would digest the summary and spit out a structured JSON output. It was much better than trying to do it in one prompt, but still far too random even with temperature at 0. Sometimes the first prompt misclassified things. Sometimes the second prompt would include a “here’s your structured output”.

And Claude did everything perfectly ;)

BoorishBears 606 days ago

Why not preprompt with ```json {

jkukul 606 days ago

Yes, you can pre-fill the assistant's response with "```json {" or even "{" and that should increase the likelihood of getting a proper JSON in the response, but it's still not guaranteed. It's not nearly reliable enough for a production use case, even on a bigger (8B) model.

I could recommend using ollama or VLLm inference servers. They support a `response_format="json"` parameter (by implementing grammars on top of the base model). It makes it reliable for a production use, but in my experience the quality of the response decreases slightly when a grammar is applied.

BoorishBears 606 days ago

Grammars are best but if you read their comment they're apparently using ollama in a situation that doesn't support them.

scriptsmith 606 days ago

Yes, I've used the v3.2 3B-Instruct model in a Slack app. Specifically using vLLM, with a template: https://github.com/vllm-project/vllm/blob/main/examples/tool...

Works as expected if you provide a few system prompts with context.

accrual 607 days ago

Not in production, but I've used a 3B model to test a local LLM application I'm working on. I needed a full end-to-end request/response and it's a lot faster asking a 3B model than an 8B model. I could setup a test harness and replay the responses... but this was a lot simpler.

jdthedisciple 606 days ago

If for testing then why not just mock the whole thing for ultimate performance ... ?

nkozyra 606 days ago

Probably faster to use off the shelf model with llama.cpp than to mock it

itsTyrion 597 days ago

I've tried using 3B outside of production. Asked it to be the character needed, like 30 words and use German. Instructions were consistently ignored, sometimes sentences devolved into Gibberish or English was mixed in halfway through. Don't even want to know how lobotomized 1B is.

JohnHammersley 606 days ago

> For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline

I was doing some local tidying up of recording transcripts, using a fairly long system prompt, and I saw the same behaviour you mention if the transcript I was passing in was too long -- batching it up to make sure to be under the max length prevented this.

Might not be what's happening in your case, but I mention it because it wasn't immediately obvious to me when I first saw the behaviour.

ipsum2 606 days ago

You can't expect a 1B model to perform as well as 7B or chatGPT, probably the best use case is speculative decoding or to use to fine tune for a specific use case.

theanonymousone 606 days ago

What is "speculative decoding"?

regularfry 606 days ago

Speculative decoding is using a small model to quickly generate a sequence that every so often you pass through a larger model to check and correct. It can be much faster than just using the larger model, with tolerably close accuracy.

qeternity 606 days ago

> with tolerably close accuracy.

No, speculative decoding has exactly the same accuracy as the target model. It is mathematically identical to greedy decoding.

kgc 605 days ago

Is there a reference for this? I was wondering the same thing.

qeternity 605 days ago

Read the original whitepaper or go look at how any framework implements it.

You will see that tokens not predicted by greedy sampling of the target model are rejected. Ergo, they are mathematically identical.

nikolayasdf123 606 days ago

+1 1B and 3B models perform so poorly, it is bellow any acceptance for us. and we have fairly simple natural language understanding.

blinkingled 606 days ago

Just tried asking Llama 3.2:3b to write a YAML file with Kubernetes Deployment definition. It spit the yaml out but along with a ton of explanations. But when I followed up with below it did what I want it to do.

>>> Remove the explanation parts and only leave yaml in place from above response. apiVersion: apps/v1 kind: Deployment metadata: name: my-deployment spec: replicas: 3 ...

Alternatively this worked as well >>> Write a YAML file with kubernetes deployment object in it. Response should only contain the yaml file, no explanations. ... ions. ```yml apiVersion: apps/v1 kind: Deployment metadata: name: example-deployment spec: replicas: 3 selector: matchLabels: app: example-app template: metadata: labels: app: example-app spec: containers: - name: example-container image: nginx:latest ports: - containerPort: 80 ```

bloomingkales 606 days ago

Qwen2.5 3b is very very good.