Hacker News new | ask | show | jobs
by andy99 902 days ago
I've been using one of the earlier checkpoints for benchmarking a Llama implementation. Completely anecdotally I feel at least as good or better about this one than the earlier openllama 3B. I wouldn't use either of them for RAG or anything requiring more power, just to say that it's competitive as a smaller model, whatever you use those for, and easy to run on CPU at FP16 (meaning without serious quantization).
4 comments

Also, I should promote the code I wrote for running this. It runs models in ggml format, the one I made available is an older checkpoint though. It's easy to convert the newer one. And it's in Fortran but it should be easy to get gfortran if you don't have it installed.

https://github.com/rbitr/llm.f90/tree/optimize16/purefortran

Here is another inference implementation in Python (only dependency is PyTorch).

https://github.com/99991/SimpleTinyLlama

The new checkpoints did not seem much better and they changed the chat format for some reason, so I did not port the new checkpoints yet. Perhaps I'll get to it this weekend.

Man I didn't recognize your username but once you said Fortran I recognized you immediately. You are an inspiration, an example of true software engineering vis a vis what in day to day becomes what you can hire for.

Edit: you have some rare knowledge, I'm curious if you have any thoughts on small models good enough for RAG. Mistral 7B is in my testing buts it's laughably slow and 7B is just too much for mobile, both iOS and Android get crashy. (4 tkns/s on Pixel Fold, similar on iOS). Similar problems on web from a good-enough 2 year old i7.

I'd try Phi-2 but I want to charge for my app and the non-commercial usage license bars that. (all these hours building ain't free! And I can't responsibly give search away, scraping locally is too risky for the user, and the free search API I know of has laudable goals, but ultimately, is "trust me bro" as far as privacy goes)

I'm starting to think we might not get an open, RAG capable model sub 7B without a concerted open source effort. Stabilitys distracted and spread thin, MS is all in on AI PCs(tm), and it's too commercially valuable for the big boys to give away

So much changed in a day. What a field!
What use cases would you say it is good enough for?
That's the billion dollar question. These are all research models, the point was to see what happens when you keep training a smaller model.

My best guess (and if I had a concrete answer I'd be out building it) is that, absent a breakthrough, smaller models will be mostly for downstream tasks, like classifiers, that aren't generative. Or fine tuned for specialized generative models that only know one domain. I don't know how well this works for real use cases, but certainly way smaller models generate Shakespeare-like text for example, I don't actually know why you'd do that though.

>I wouldn't use either of them for RAG

What's RAG?

Since the models have a limited context size, you pre-process a bunch of data that might be related to the task (documentation, say) and generate a semantic vector for each piece. The when you ask a question, look up just the few pieces that are semantically most simlar and load them into the context along with the question. Then the LLM can generate a new answer with the most relevant pieces of data.
Retrieval augmented generative, basically giving it some text passage and asking questions about the text.
If you want more on RAG with a concrete example: https://neuml.hashnode.dev/build-rag-pipelines-with-txtai
What is good for RAG?
The smallest model your users agree meets their needs. It really depends.

The retrieval part is way more important.

I've used the original 13B instruction tuned llama2, quantized, and found it gives coherent answers about the context provided, ie the bottleneck was mostly getting good context.

When I played with long context models (like 16k tokens, and this was a few months ago, maybe they improved) they sucked.

>The retrieval part is way more important.

I don't agree with this - at Intercom we've put a lot of work into our Fin chatbot, which uses a RAG architecture, and we're still using GPT-4 for the generation part.

GPT-4 is a really powerful and expensive model but we find we need this power to 1) reduce hallucinations acceptably, and 2) keep the quality of inferences made using the retrieved text high.

Now, our bot is answering customer support questions unsupervised - maybe it'd be different for a human in the loop system - but at least in our case, we feel we need a very powerful generation model to reduce errors, even after having benchmarked this thoroughly.

We've also done work on the retrieval end of things, including a customised model, but found the generation side is where we need the most capable models.

That's interesting, thanks. My experience is with technical documentation Q&A, returning summaries and relevant passages. My takeaway was that the summary is basically as good as the passages. I do think overall response quality is very subjective and really depends on how it's being used, so whatever users do best with wins the day.