Hacker News new | ask | show | jobs
Alfred-40B, an OSS RLHF version of Falcon40B (lighton.ai)
80 points by nuitblanche 1052 days ago
7 comments

Any performance benchmarks compared to other LLMs? Also, any performance increases on the orig Falcon model in inference speed?

We ditched most of our focus on Falcon 40B after Llama 2 70B came out, both the tokens per sec and quality of results are not even close.

I'm assuming this will have similar scores to the original 40B model, in which case LLaMa2 70b would outperform it. The avg score on the Open LLM Leaderboard of LLaMa2 70b instruct is 72.3.

Falcon-40B is 63.4 or 61.5 on the non instruction tuned version.

Did you fine-tune your own Llama 2? Llama2-chat is awfully gimped.
Will every comment here be a question? Will someone break the trend?
It's a good observations- there are so many unknowns about all these models. Every day there's a new wizard_uncensored_rhlf_alpaca_tuned_best_one_use_this_13B_4.6-bit_rqm.pth that gets released, it's almost impossible to know the relative merits and which are worth paying attention to.
How true. Every time I browse a model listing there are four word descriptions with little sense of versioning, provenance, hardware requirements, or any reason why I would choose one vs the other.
What kind of hardware do I need to run this sufficiently well? I.e. say I want 10 tokens/s, what specs am I looking at?
This particular model has 83.66gb of model weights so you'll need to 2x Nvidia 80gb A100 at a minimum unless you're loading it in 8bit mode.
With that said, there are ggml/gptq and other optimization techniques.
Pretty much anything with 32GB (?) total RAM+VRAM:

https://github.com/cmp-nct/ggllm.cpp

But its going to be slow without even a small Nvidia GPU (a 2060?). CPUs are really slow at prompt ingestion, and that can't be hidden with streaming.

Doesn't this new version of falcon need to be ggml'ed first?
The architecture is the same I belive, it's just a fine tune so there's nothing special to be done for this version. That said, ggml doesn't support Falcon, but i saw today there is a fork that claims to, though I didn't try it.
That link above is the fork ^

It uses the ggml library, just like llama.cpp does, and is indeed a fork of llama.cpp's implementation of ggml.

Right, I'm being stupid, that's the fork I saw earlier today I didn't realize. Have you tried it? Iirc the documentation mentioned at 2-bit quantizatikn of the 40B model performing well. I've been using a 5-bit 7B llama2 which I'm generally happy with (because it can run in a pretty crappy machine) but interested to see the differences.
Igor, is LightOn still working on optical computing? Or have you fully pivoted to genai without optics?
Will the team be releasing the momentum-internal dataset?
Anyone keeping track of all these LLM releases?
Kind of. There's https://github.com/Hannibal046/Awesome-LLM and you can follow the subreddit too. They're not amazing though, so I'm in the process of making my own catalogue.
so its open source? Where is the link to the weights?

The article was mildly annoying because of many underlines sentences and words that looked like links.

The weights are literally linked in the article. The license is Apache 2.0

https://huggingface.co/lightonai/alfred-40b-0723/tree/main

The fact that this empty SEO blogspam bizarrely underlines (e.g. just like links) loads of content kind of obscures the single link to HuggingFace.
thanks!