Hacker News new | ask | show | jobs
by theanonymousone 607 days ago
May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :/
11 comments

3B models are perfectly capable, I've had great luck with Phi 3.5.

> For example, they seem to not care about instructions to only write a response and no explanation

You need to use tools to force the model to adhere to a schema. Or you can learn to parse out the part of the response you want, both work.

You'll also need to make good use of robust examples in your initial prompt, and give lots of examples of how you want the output to look. (Yes this quickly burns up the limited context length!)

Finally, embrace the fact that these models are tuned for chat, so the more conversational you make the back and forth the less you are stretching the models abilities.

I wrote a very small blog post at https://meanderingthoughts.hashnode.dev/unlock-the-full-pote... explaining some of this.

I wonder if CUE can help the situation in similar fashion to the DSL methods that you've described in your blog post [1]. After all CUE fundamentals are based on feature structure from the deterministic approach of NLP unlike LLM that's stochastic NLP [2],[3]. Perhaps deterministic and non-deterministic approaches is the potent combination that can effectively help reduce much of the footprint to get to the same results and being energy efficient in the process.

[1] Cue – A language for defining, generating, and validating data:

https://news.ycombinator.com/item?id=20847943

[2] Feature structure:

https://en.m.wikipedia.org/wiki/Feature_structure

[3] The Logic of CUE:

https://cuelang.org/docs/concept/the-logic-of-cue/

On my LinkedIn post about this topic someone actually replied with a superior method of steering LLM output compared to anything else I've ever heard of, so I've decided that until I find time to implement their method, I'm not going to worry about things.

tl;dr you put into the prompt all the JSON up until what you want the LLM to say, and you set the stop token to the end token of the current JSON item (so ',' or '}' ']', whatever) and you then your code fills out the rest of the JSON syntax up until another LLM generated value is needed.

I hope that makes sense.

It is super cool, and I am pretty sure there is a way to make a generator that takes in an arbitrary JSON schema and builds a state machine to do the above.

The performance should be super fast on locally hosted models that are using context caching.

Eh I should write this up as a blog post, hope someone else implements it, and if not, just do it myself.

There are many solutions for constrained/structured generation with LLMs these days, here is a blog post my employer published about this a while back: https://monadical.com/posts/how-to-make-llms-speak-your-lang...

I'm partial to Outlines lately, but they all have various upsides and downsides.

OpenAI even natively added support for this on their platform recently: https://openai.com/index/introducing-structured-outputs-in-t...

This is a really good post. I did find one error, Instructor works well with at least one other back end (Ollama).

Outlines looks quite interesting but I wasn't able to get it to work reliably.

With mixlayer, because the round trip time to the model is so short, you can alternate between appending known tokens of the JSON output and values you want the model to generate. I think this works better than constraining the sampling in a lot of cases.

We haven’t built a state machine over JSON schema that uses this approach yet but it’s on the way.

> With mixlayer, because the round trip time to the model is so short, you can alternate between appending known tokens of the JSON output and values you want the model to generate. I think this works better than constraining the sampling in a lot of cases.

Wow, that is a much more succinct way of describing it!

> We haven’t built a state machine over JSON schema that uses this approach yet but it’s on the way.

Really this should just be a simple library in JS and Python. Schema goes in, state machine pops out.

Complications will be around optional fields, I'm not sure offhand how to solve that!

I'd love it if you checked out what we've been working on.

It's still in early stages, but might be usable for something you're trying to build. Here's an example (this buffers the entire JSON object, but you can also gen as you go): https://docs.mixlayer.com/examples/json-output

I’ve only toyed with them a bit, and had a similar experience - but did find I got better output by forcing them to adhere to a fixed grammar: https://github.com/ggerganov/llama.cpp/tree/master/grammars

For context, I was playing with a script to bulk download podcasts, transcribe with whisper, pass the transcription to llama.cpp to ID ads, then slice the ads out with ffmpeg. I started with the generic json_array example grammar, then iteratively tweaked it.

For me, it was almost random if I would get a little spiel at the beginning of my response - even on the unquantized 8b instruct. Since ollama doesn’t support grammars, I was trying to get it to work where I had a prompt that summarized an article and extracted and classified certain information that I requested. Then I had another prompt that would digest the summary and spit out a structured JSON output. It was much better than trying to do it in one prompt, but still far too random even with temperature at 0. Sometimes the first prompt misclassified things. Sometimes the second prompt would include a “here’s your structured output”.

And Claude did everything perfectly ;)

Why not preprompt with ```json {
Yes, you can pre-fill the assistant's response with "```json {" or even "{" and that should increase the likelihood of getting a proper JSON in the response, but it's still not guaranteed. It's not nearly reliable enough for a production use case, even on a bigger (8B) model.

I could recommend using ollama or VLLm inference servers. They support a `response_format="json"` parameter (by implementing grammars on top of the base model). It makes it reliable for a production use, but in my experience the quality of the response decreases slightly when a grammar is applied.

Grammars are best but if you read their comment they're apparently using ollama in a situation that doesn't support them.
Yes, I've used the v3.2 3B-Instruct model in a Slack app. Specifically using vLLM, with a template: https://github.com/vllm-project/vllm/blob/main/examples/tool...

Works as expected if you provide a few system prompts with context.

Not in production, but I've used a 3B model to test a local LLM application I'm working on. I needed a full end-to-end request/response and it's a lot faster asking a 3B model than an 8B model. I could setup a test harness and replay the responses... but this was a lot simpler.
If for testing then why not just mock the whole thing for ultimate performance ... ?
Probably faster to use off the shelf model with llama.cpp than to mock it
I've tried using 3B outside of production. Asked it to be the character needed, like 30 words and use German. Instructions were consistently ignored, sometimes sentences devolved into Gibberish or English was mixed in halfway through. Don't even want to know how lobotomized 1B is.
> For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline

I was doing some local tidying up of recording transcripts, using a fairly long system prompt, and I saw the same behaviour you mention if the transcript I was passing in was too long -- batching it up to make sure to be under the max length prevented this.

Might not be what's happening in your case, but I mention it because it wasn't immediately obvious to me when I first saw the behaviour.

You can't expect a 1B model to perform as well as 7B or chatGPT, probably the best use case is speculative decoding or to use to fine tune for a specific use case.
What is "speculative decoding"?
Speculative decoding is using a small model to quickly generate a sequence that every so often you pass through a larger model to check and correct. It can be much faster than just using the larger model, with tolerably close accuracy.
> with tolerably close accuracy.

No, speculative decoding has exactly the same accuracy as the target model. It is mathematically identical to greedy decoding.

Is there a reference for this? I was wondering the same thing.
Read the original whitepaper or go look at how any framework implements it.

You will see that tokens not predicted by greedy sampling of the target model are rejected. Ergo, they are mathematically identical.

+1 1B and 3B models perform so poorly, it is bellow any acceptance for us. and we have fairly simple natural language understanding.
Just tried asking Llama 3.2:3b to write a YAML file with Kubernetes Deployment definition. It spit the yaml out but along with a ton of explanations. But when I followed up with below it did what I want it to do.

>>> Remove the explanation parts and only leave yaml in place from above response. apiVersion: apps/v1 kind: Deployment metadata: name: my-deployment spec: replicas: 3 ...

Alternatively this worked as well >>> Write a YAML file with kubernetes deployment object in it. Response should only contain the yaml file, no explanations. ... ions. ```yml apiVersion: apps/v1 kind: Deployment metadata: name: example-deployment spec: replicas: 3 selector: matchLabels: app: example-app template: metadata: labels: app: example-app spec: containers: - name: example-container image: nginx:latest ports: - containerPort: 80 ```

Qwen2.5 3b is very very good.