| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brucethemoose2 929 days ago

In other llm news, Mistral/Yi finetunes trained with a new (still undocumented) technique called "neural alignment" are blasting other models in the HF leaderboard. The 7B is "beating" most 70Bs. The 34B in testing seems... Very good:

https://huggingface.co/fblgit/una-xaberius-34b-v1beta

https://huggingface.co/fblgit/una-cybertron-7b-v2-bf16

I mention this because it could theoretically be applied to Mistral Moe. If the uplift is the same as regular Mistral 7B, and Mistral Moe is good, the end result is a scary good model.

This might be an inflection point where desktop-runnable OSS is really breathing down GPT-4's neck.

9 comments

eurekin 929 days ago

I just played with 7b version. It really feels different than anything I tried before. It could explain a docker compose file. It generated a simple vue application component.

I asked around a bit about the example and it was strangely coherent and focused across the whole conversation. It was really well detecting, where I'm starting a new thread (without clearing a context) or referring to things before.

It caught me off guard as well with this:

> me: What does following mean [content of the docker compose]

> cybertron-7b: In the provided YAML configuration, "following" refers to specifying dependencies

I've never seen any model using my exact wording in quotes in conversation like that.

mark_l_watson 929 days ago

How did you run it? Are there model files in Ollama format? Are you running on NVidia or Apple Silicon?

EDIT: just saw this “ Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.”

brucethemoose2 929 days ago

My recommendation is:

- Exui with exl2 files on good GPUs.

- Koboldcpp with gguf files for small GPUs and Apple silicon.

There are many reasons, but in a nutshell they are the fastest and most VRAM efficient.

I can fit 34Bs with about 75K context on a single 24GB 3090 before the quality drop from quantization really starts to get dramatic.

mark_l_watson 929 days ago

Thanks! I will check out Koboldcpp.

eurekin 929 days ago

In the textgeneration web ui on NVidia gpu

whimsicalism 928 days ago

your edit is entirely unrelated to this topic

brucethemoose2 929 days ago

Yeah, the Yi version is quite something too.

nikvdp 929 days ago

This piqued my interest so I made an ollama modelfile of it for the smallest variant (from TheBloke's GGUF [1] version). It does indeed seem impressively gpt4-ish for such a small model! Feels more coherent than openhermes2.5-mistral which was my previous goto local llm.

If you have ollama installed you can try it out with `ollama run nollama/una-cybertron-7b-v2`.

[1]: https://huggingface.co/TheBloke/una-cybertron-7B-v2-GGUF

fblgit 929 days ago

Correct. UNA can align the MoE at multiple layers, experts, nearly any part of the neural network I would say. Xaberius 34B v1 "BETA".. is the king, and its just that.. the beta. I'll be focusing on the Mixtral, its a christmas gift.. modular in that way, thanks for the lab @mistral!

brucethemoose2 929 days ago

Do a Yi 200K version as well! That would make my Christmas, as Mistral Moe is only maybe 32K.

inciampati 926 days ago

Do you have any docs describing the method?

stavros 929 days ago

Aren't LLM benchmarks at best irrelevant, at worst lying, at this point?

sbierwagen 929 days ago

If you don't like machine evaluations, you can take a look at the lmsys chatbot arena. You give a prompt, two chatbots answer anonymously, and you pick which answer is better: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

On the human ratings, three different 7B LLMs (Two different Openchat models and a Mistral fine tune) beat a version of GPT-3.5.

(The top 9 chatbots are GPT and Claude versions. Tenth place is a 70B model. While it's great that there's so much interest in 7B models, and it's incredible that people are pushing them so far, I selfishly wish more effort would go into 13B models... since those are the biggest that my macbook can run.)

behnamoh 929 days ago

I think the current approach — train 7b models and then do MoE on them — is the future. It’ll still be only runnable on high end customer devices. As for 13b + MoE, I don’t think any customer device could handle that in the next couple years.

sbierwagen 929 days ago

My years-old M1 macbook with 16GB of ram runs them just fine. Several Geforce 40-series cards have at least 16GB of vram. Macbook pros go up to 128GB of ram and the mac studio goes up to 192GB. Running regular CPU inference on lots of system ram is cheap-ish and not intolerably slow.

These aren't totally common configurations, but they're not totally out of reach like buying an H100 for personal use.

behnamoh 929 days ago

1. I wouldn't consider Mac Studio ($7,000) a customer product.

2. Yes, and my MBP M1 Pro can run quantized 34b models. My point was that when you do MoE, memory requirements suddenly become too challenging. A 7b Q8 is roughly 7GB (7b parameters × 8 bits each). But 8x of that would be 56GB, and all of that must be in memory to run.

jart 928 days ago

Why? $7k for a Mac Studio isn't much if we consider the original IBM PC adjusted for inflation cost $6k.

stolsvik 929 days ago

I have no formal credentials to say this, but intuitively I feel this is obviously wrong. You couldn’t have taken 50 rats brains and “mixed” them and expected the result to produce new science.

For some uninteresting regurgitation, sure. But size - width and depth - seems like an important piece for ability to extract deep understanding of the universe.

Also, MoE, as I understand it, will inherently not be able to glean insight into, nor reason about, and certainly not be able to come up with novel understanding, for cross-expert areas.

I believe size matters, a lot.

brucethemoose2 928 days ago

The MOE models are essentially trained as a single model. Its not 7 independent models, individually (AFAIK) they are all totally useless without each other.

Its just that each bit picks up different "parts" of the training more strongly, which can be selectively picked at runtime. This is actually kinda analogous to animals, which dont fire every single neuron so frequently like monolithic models do.

The tradeoff, at equivalent quality, is essentially increased VRAM usage for faster, more splittable inference and training, though the exact balance of this tradeoff is an excellent question.

whimsicalism 928 days ago

Pretty sure openchat-3.5 is a mistral fine tune as well.

The trick is not this neural alignment - it is training on many, many more tokens than Chinchilla recommends.

puttycat 929 days ago

I wonder how it will rank on benchmarks which are password-protected to prevent test contamination, for example: https://github.com/taucompling/bliss

brucethemoose2 929 days ago

Yes, absolutely. I was just preaching this.

But its not totally irrelevant. They are still a datapoint to consider with some performance correlation. YMMV, but these models actually seem to be quite good for the size in my initial testing.

typon 929 days ago

Yes. The only thing that is relevant is a hidden benchmark that's never released and run by a trusted third party.

nabakin 929 days ago

More or less. The automated benchmarks themselves can be useful when you weed out the models which are overfitting to them.

Although, anyone claiming a 7b LLM is better than a well trained 70b LLM like Llama 2 70b chat for the general case, doesn't know what they are talking about.

In the future will it be possible? Absolutely, but today we have no architecture or training methodology which would allow it to be possible.

You can rank models yourself with a private automated benchmark which models don't have a chance to overfit to or with a good human evaluation study.

Edit: also, I guess OP is talking about Mistral finetunes (ones overfitting to the benchmarks) beating out 70b models on the leaderboard because Mistral 7b is lower than Llama 2 70b chat.

airgapstopgap 929 days ago

> today we have no architecture or training methodology which would allow it to be possible.

We clearly see that Mistral-7B is in some important, representative respects (eg coding) superior to Falcon-180B, and superior across the board to stuff like OPT-175B or Bloom-175B.

"Well trained" is relative. Models are, overwhelmingly, functions of their data, not just scale and architecture. Better data allows for yet-unknown performance jumps, and data curation techniques are a closely-guarded secret. I have no doubt that a 7B beating our best 60-70Bs is possible already, eg using something like Phi methods for data and more powerful architectures like some variation of universal transformer.

nabakin 929 days ago

I mean, I 100% agree size is not everything. You can have a model which is massive but not trained well so it actually performs worse than a smaller, better/more efficiently trained model. That's why we use Llama 2 70b over Falcon-180b, OPT-175b, and Bloom-175b.

I don't know how Mistral performs on codegen specifically, but models which are finetuned for a specific use case can definitely punch above their weight class. As I stated, I'm just talking about the general case.

But so far we don't know of a 7b model (there could be a private one we don't know about) which is able to beat a modern 70b model such as Llama 2 70b. Could one have been created which is able to do that but we simply don't know about it? Yes. Could we apply Phi's technique to 7b models and be able to reach Llama 2 70b levels of performance? Maybe, but I'll believe it when we have a 7b model based on it and a human evaluation study to confirm. It's been months now since the Phi study came out and I haven't heard about any new 7b model being built on it. If it really was such a breakthrough to allow 10x parameter reduction and 100x dataset reduction, it would be dumb for these companies to not pursue it.

brucethemoose2 929 days ago

I'm not saying its better than 70B, just that its very strong from what others are saying.

Actually I am testing the 34B myself (not the 7B), and it seems good.

fblgit 929 days ago

UNA: Uniform Neural Alignment. Haven't u noticed yet? Each model that I uniform, behaves like a pre-trained.. and you likely can fine-tune it again without damaging it.

If you chatted with them, you know .. that strange sensation, you know what is it.. Intelligence. Xaberius-34B is the highest performer of the board, and is NOT contaminated.

valine 929 days ago

How much data do you need for UNA? Is a typical fine tuning dataset needed or can you get away with less than that?

brucethemoose2 929 days ago

In addition to what was said, if its anything like DPO you don't need a lot of data, just a good set. For instance, DPO requires "good" and "bad" responses for each given prompt.

fblgit 929 days ago

doesn't require much data, in a 7B can take a couple hours ~

nabakin 929 days ago

> I'm not saying its better than 70B, just that its very strong from what others are saying.

Gotcha

> Actually I am testing the 34B myself (not the 7B), and it seems good.

I've heard good things about it

elcomet 929 days ago

We can only compare specific training procedures though.

With a 7b and a 70b trained the same way, the 70b should always be better

nabakin 929 days ago

Makes sense

mistrial9 929 days ago

quick to assert authoritative opinion - yet the one word "better" belies the message ? Certainly there is are more dimensions worth including in the rating?

nabakin 929 days ago

Certainly, there may be aspects of a particular 7b model which could beat another particular 70b model and greater detail into different pros and cons of different models are worth considering but people are trying to rank models and if we're ranking (calling one "better" than another), we might as well do it as accurately as we can since it can be so subjective.

I see too many misleading "NEW 7B MODEL BEATS GPT-4" posts. People test those models a couple of times, come back to the comments section, declare it true, and onlookers know no better than to believe it and in my opinion has led to many people claiming 7b models have gotten as good as Llama 2 70b or GPT-4 when it is not the case when you account for the overfit being exhibited by these models and actually put them to the test via human evaluation.

screye 929 days ago

Yeah, and Mistral doesn't particularly care about lobotomizing the model with 'safety-training'. So it can achieve much better performance per-parameter than anthropic/google/OpenAI while being more steerable as well.

behnamoh 929 days ago

until Mistral gets too big for lawyers to ignore.

_boffin_ 929 days ago

Interesting. One thing i noticed is that Mistral has a `max_position_embeddings` of ~32k while these have it at 4096.

Any thoughts on that?

brucethemoose2 929 days ago

Is complicated.

The 7B model (cybertron) is trained on Mistral. Mistral is technically a 32K model, but it uses a sliding window beyond 32K, and for all practical purposes in current implementations it behaves like an 8K model.

The 34B model is based on Yi 34B, which is inexplicably marked as a 4K model in the config but actually works out to 32K if you literally just edit that line. Yi also has a 200K base model... and I have no idea why they didn't just train on that. You don't need to finetune at long context to preserve its long context ability.

ComputerGuru 929 days ago

Did you mean "but it uses a sliding window beyond" *8K*? Because I don't understand how the sentence would work otherwise.

brucethemoose2 929 days ago

Yeah exactly, sorry.

whimsicalism 928 days ago

DPO is pretty good as well.

I think that the '7b beating 70b' is mostly due to the fact that Mistral is likely trained on considerably more tokens than Chinchilla optimal. So is llama-70b, but not to the same degree.

3abiton 929 days ago

HF leaderboards are rarely reflective of real world performance especially in small variations, but nonetheless, this is promising. What are the HW requirements for this latest Mistral7B?

eyegor 929 days ago

Any 7b can run well (~50 tok/s) on an 8gb gpu if you tune the context size. 13b can sometimes run well but typically you'll end up with a tiny context window or slow inference. For cpu, I wouldn't recommend going above 1.3b unless you don't mind waiting around.

aarushsah 929 days ago

How so? I'm only getting 12 t/s using Mistral in LM Studio.

eyegor 929 days ago

The lazy way is to use text-generation-webui, use an exllamav2 conversion of your model, and turn down context length until it fits (and tick the 8 bit cache option). If you go over your vram it will cut your speed substantially. Like 60/s down to 15/s for an extra 500 context length over what fits. Similar idea applies to any other backends, but you need to shove all the layers into vram if you want decent tok/s. To give you a starting point, typically for 7b models I have to use 4k-6k context length and I use 4-6 bit quantizations for an 8gb gpu. So start at 4 bit, 4k context and adjust up as you can.

You can find most popular models converted for you on huggingface.co if you add exl2 to your search and start with the 4 bit quantized version. Don't bother going above 6 bits even if you have spare vram, practically it doesn't offer much benefit.

For reference I max out around 60 tok/s at 4bit, 50 tok/s at 5bit, 40 at 6bit for some random 7b parameter model on a rtx 2070.

aarushsah 928 days ago

Just tried it - doesn't seem to be working. In fact, I'm getting 1.4 t/s with a Quadro P4000 (8 GB) running a 7B at 3 bits per weight. Are you changing anything other than the 8 bit cache and context?

For reference, I'm getting 10 t/s with a Q5_K_M Mistral GGUF model.

brucethemoose2 929 days ago

> What are the HW requirements for this latest Mistral7B

Pretty much anything with ~6-8GB of memory that's not super old.

It will run on my 6GB laptop RTX 2060 extremely quickly. It will run on my IGP or Phone with MLC-LLM. It will run fast on a laptop with a small GPU, with the rest offloaded to CPU.

Small, CPU only servers are kinda the only questionable thing. It runs, just not very fast, especially with long prompts (which are particularly hard for CPUs). There's also not a lot of support for AI ASICs.

swyx 929 days ago

what is neural alignment? who came up with it?

brucethemoose2 929 days ago

@fblgit apparently, from earlier in this thread.