Hacker News new | ask | show | jobs
by d--b 842 days ago
IMO, small LLMs are not good enough yet. I understand that people prefer running stuff on-device for privacy and cost reason, but a model that makes mistakes all the time is not worth the tradeoff.
6 comments

This one is particularly bad. I tested the Tinyllama web chat [1] and got this result:

>> What is the population density of Nebraska?

> Yes, I can provide you with the population density of Nebraska:

> As of the 2020 United States Census, the population density in Nebraska was 107.9 people per square mile (41.9/km²).

>> What is the population of Nebraska?

> Yes, I can provide you with the population of Nebraska as of the 2020 United States Census:

> As of the 2020 United States Census, the population of Nebraska was 1,835,673.

>> What is the land area of Nebraska?

> Yes, I can provide you with the land area of Nebraska as of the 2020 United States Census:

> As of the 2020 United States Census, the land area of Nebraska was 11,761 square miles (30,590 square kilometers).

>> What is 1,835,673 / 11,761?

> Yes, 1,835,673 / 11,761 is approximately 15.46.

Per Wikipedia, the 2020 Population of Nebraska is 1,961,504 (close, but not quite), land area is 76,796 sq. mi. (way off), density is about 25/sq. mi. (way off).

And according to the standard rules of arithmetic, 1835673 / 11761 = 156.08, making this almost (but not quite) one order of magnitude off, and not even the erroneous answer of 15.46 is consistent with the other erroneous figure it gave for the population density of Nebraska (107.9).

[1]: https://huggingface.co/spaces/TinyLlama/tinyllama-chat

This is anecdata but "good enough" is relative. I've finetuned TinyLlama with the same dataset and technique as Llama2 7B for on-device purposes (not for cost or privacy but for physical hardware that have to run offline and with low power consumption) and it produces higher task alignment in 1/4 the inference time. As a general purpose model it isn't great but small models have their place in the ecosystem.
Care to elaborate on the finetune? It's surprisingly very hard to come across a useful finetuning examples.
Sure, very generally we're doing PEFT starting with insights from examples very much like this one [0] and have gradually built our own tooling and customized the approach a lot as the underlying Huggingface libraries have progressed even in the last 6 months.

I will say that one of the most important parts of the process that I've found is in the prompt structuring, the use of special tokens based on how the base models were trained and customizing the tokenizer where necessary. That work in particular is not covered adequately by the examples I was able to find when I started, in my opinion.

[0] https://medium.com/@kshitiz.sahay26/fine-tuning-llama-2-for-...

Some Mistral 7B fine tunes are borderline usable, but yes it's still very marginal.
yes, i can see that, they are fun to play with though, many of the responses are interesting, and yes they will get more powerful fast, so swapping for another model will be possible and soon i will support this
we'll see how things play out after the new paper with 1.6 bits and no performance loss. This would mean being able to fit on device much bigger models
> IMO, small LLMs are not good enough yet

Sure, but the only way they're going to get there is by people iterating on them while they're still crap