Hacker News new | ask | show | jobs
Open-Llama: Complete training pipeline for building large language models (github.com)
141 points by bayes-song 1139 days ago
4 comments

Check out this model trained using the Open-Llama project at http://home.ustc.edu.cn/~sl9292 . This model is trained primarily on English and Chinese, but also has capabilities in other languages like Japanese and Korean. Now, let's dive into Open-Llama. It's a truly open-source project for pre-training and instruct-tuning AI models. One of the key features of this project is its support for a wide range of model sizes, from 7B to 65B parameters. What sets Open-Llama apart is the incorporation of performance acceleration via xformers from Llama, enabling 95% of the original Llama speed on the 65B models. In fact, for the 7B models, Open-Llama's performance surpasses the original Llama. By providing full access to the codebase, we believe that Open-Llama will contribute greatly to the advancement of open-source AI technologies. We invite developers and researchers to join us on this exciting journey!
Namespace collisions are inevitable, especially w/ how fast-moving the LLM space is right now, just wanted to point out that besides this "Open-Llama" project (which looks really interesting, and well documented in the Github repo), there is also another group training "OpenLLaMA" https://github.com/openlm-research/open_llama (which looks like an effort by two Berkeley PhD students, https://www.haoliu.site/ and http://young-geng.xyz/ to reproduce LLaMA using the 1.2T token Together RedPajama dataset. They've released up to a 300B checkpoint so far.)

Feedback for /u/bayes-song - it'd be great to have a more info on the model card on HF - right now it's unclear the parameter count, # of total tokens you're planning on training on/how many you've trained on so far. An Evaluation section (maybe using lm-evaluation-harness) might be good as well?

To add to that, I believe the title of this submission is a reference ("Open-Lamam: A “real” open-source project to train LLM not just checkpoints") to this project you link, since they did not (to my knowledge), release the code for the training or detailed instructions to reproduce their experiment precisely, only checkpoints.
Thank you for your suggestion, this will indeed be more intuitive, I will add relevant results as soon as possible.
Feels like its still the area of wait and see as the space shakes out. It would be great to be able to run our own models in some near future for applications but the amount of hardware needed to delivery service to a significant audience is pretty crazy. Right now I don't see any way but to re-bill the cost with a markup to end customers unless you have a giant pile of VC money that you can light on fire.
I found running the model on rented hardware much more expensive than ChatGPT. Might work ok for local sunk cost hardware for those who game and don’t crypto mine.
WebGPU would be a way to shift that cost back to each device. RedPajama 3B could become useful for some tasks and run quite fast on most hardware available. Then as users have better computers, they can get access to better models?
You'd have to have each user download a 3GB payload first. For comparison, that's a good few hours of netflix at 1080p.
Browsers and operating systems will eventually include them.

But for now, you would need a good privacy reason to go this route.

Sorry for newbie question. What’s the advantage of retraining the model versus using an already trained model through API (OPEN AI). I understand the economic principle of “make or buy”, but is there else?
This is a good question. Personally, I think it is a better way for most companies to directly use the API of OpenAI, which is currently the best in terms of ease of use and ability in the general field. I think there are only a few types that need to retrain the model themselves. 1. Researchers who are very cutting-edge and have sufficient funds. 2. Large companies that require ultra large amounts of calls or have strict confidentiality requirements. 3. Application scenarios in very specific fields, such as chemical molecules, may require more domain specific pretraining.
The problem is that making a summary of a text of 100k token it costs 2$ using Davinci.