Hacker News new | ask | show | jobs
by ftxbro 1180 days ago
His estimate is that you could train a LLaMA-7B scale model for around $82,432 and then fine-tune it for a total of less than $85K. But when I saw the fine tuned LLaMA-like models they were worse in my opinion even than GPT-3. They were like GPT-2.5 or like that. Not nearly as good as ChatGPT 3.5 and certainly not ChatGPT-beating. Of course, far enough in the future you could certainly run one in the browser for $85K or much less, like even $1 if you go far enough into the future.
3 comments

Yeah, you're right. I wrote this a couple of weeks ago at the height of LLaMA hype, but with further experience I don't think the GPT-3 comparisons hold weight.

My biggest problem: I haven't managed to get a great summarization out of a LLaMA derivative that runs on my laptop yet. Maybe I haven't tried the right model or the right prompt yet though, but that feels essential to me for a bunch of different applications.

I still think a LLaMA/Alpaca fine-tuned for the ReAct pattern that can execute additional tools would be a VERY interesting thing to explore.

[ ReAct: https://til.simonwillison.net/llms/python-react-pattern ]

my biggest problem with these models is that they cannot reliably produce structured data.

even davinci can be used as part of a chain, because you can direct it to structure and unstructure data, and then extract the single component and build them into tasks. cohere, llama et al are currently struggling to consistently produce these result reliably, even if you can chat with them and frankly it's not about the chat

example from a stack overflow that split the questions before sending it down chain for answering all points individually:

This is a customer question:

I'm a beginner RoR programmer who's planning to deploy my app using Heroku. Word from my other advisor friends says that Heroku is really easy, good to use. The only problem is that I still have no idea what Heroku does...

I've looked at their website and in a nutshell, what Heroku does is help with scaling but... why does that even matter? How does Heroku help with:

    Speed - My research implied that deploying AWS on the US East Coast would be the fastest if I am targeting a US/Asia-based audience.

    Security - How secure are they?

    Scaling - How does it actually work?

    Cost efficiency - There's something like a dyno that makes it easy to scale.

    How do they fare against their competitors? For example, Engine Yard and bluebox?
Please use layman English terms to explain... I'm a beginner programmer.

Extract the scenario from the question including a summary of every detail, list every question, in JSON:

{ "scenario": "A beginner RoR programmer is planning to deploy their app using Heroku and is seeking advice about deploying it.", "questions": [ "What does Heroku do?", "How does deploying AWS on the US East Coast help with speed?", "How secure is Heroku?", "How does scaling with Heroku work?", "What is a dyno and why is it cost efficient?", "How does Heroku compare to its competitors, such as Engine Yard and Bluebox?" ] }

Last weekend I built some tooling that you can integrate with huggingface transformers to force a given model to _only_ output content that validates against a JSON schema [1].

The challenge is that for it to work cost effectively you need to be able to append what is basically a final network layer to the model that is algorithmically designed and until OpenAI exposes the full logits and/or some way to modify them on the fly you're going to be stuck with open source models. I've run things against GPT-2 mostly but it's only list to try LLaMA.

[1] "Structural Alignment: Modifying Transformers (like GPT) to Follow a JSON Schema" @ https://github.com/newhouseb/clownfish

This feels solvable to me. I wonder if you could use fine tuning against LLaMA to teach it to do this better?

GPT-3 etc can only do this because they had a LOT of code included in their training sets.

The LLaMA paper says Github was 4.5% of the training corpus, so maybe it does have that stuff baked in and just needs extra tuning or different prompts to tap into that knowledge.

I have done it trough stages, so first stages emits in natural language in the format of "context: ... and question: ...." and then the second stage collect it as json, but then wait time doubles.
Have you tried bigger models? Llama-65B can indeed compete with GPT-3 according to various benchmarks. The next thing would be to get the fine-tuning as good as OpenAI's.
I wonder how accurate those benchmarks are in terms of actual problem solving capability. I think there's a major line at which point LLM becomes actually useful and it actually feels like you are speaking to something intelligent and that can be useful for you in terms of productivity etc.
They aren't at all. They are synthetic benchmarks that carry little resemblance to real world experience.
Yeah, the constant barrage of "THIS IS AS GOOD AS CHATGPT AND IS PRIVATE" screeds from LLaMA-based marketing projects are getting ridiculous. They're not even remotely close to the same quality. And why would they be?

I want the best LLMs to be open source too, but I'm not delusional enough to make insane claims like the hundreds of GitHub forks out there.

> I want the best LLMs to be open source too

How do you do this without being incredibly wealthy?

You (1) are a company who (2) understands the business domain and has an appropriate business plan.

Sadly the reality of funding today makes it unlikely that these two will both be simultaneously satisfied. The problem is that history will look back on the necessary business plan and deem it a failure even if it generates a company that does a billion dollars plus in annual revenue.

This is actually not unique to large language models but most innovation around computers. The basic problem is that if you build a force-multiplier (spreadsheets, personal computing, large-language models all come to mind) then what will make it succeed is its versatility: people want a hammer that can be used for smashing all manner of things, not just your company's particular brand of matching nails. And most people will only pick up that hammer once per week or once per month, only like 1% of the economy if that will be totally revolutionized, "we use this force-multiplier every day, it is now indispensable, we can't imagine life without it," and it's never predictable what that sector will be -- it's going to be like "oh, who ever dreamed that the killer application for LLMs would be them replacing AutoCAD at mechanical contractors" or some shit.

In those strange eons, to wildly succeed, one must give up on anticipating all usages of the software, one must cease controlling it and set it free. "Well where's the profit in that?" -- it is that this company was one of the first players in the overall market, they got an early chance to stake out as much territory as possible. But the market exploded way larger than they could handle and then everybody looks back on them and says "wow, what a failure, they only captured 1% of that market, they could have been so much more successful." Yeah, they captured 1% of a $100B market, some failure, right?

But what actually happens is that companies see the potential, investors get dollar signs in their eyes, everyone starts to lock down and control these, "you may use large language models but only in the ways that we say, through the interfaces which we provide," and then the only thing that you can use it for is to get generic conversational advice about your hemorrhoids, so after 5-10 years the bubble of excitement fizzles out. Nobody ever dreams to apply it to AutoCAD or whatever, and the world remains unchanged.

History is littered with great software that died because no-one used it because the business model was terrible. Capturing $1B of value is better than 0, and everyone understands this. And who cares what history thinks anyway?

OpenAI has spent a lot of money to get their result. It's safe to assume it will take a lot of money to get a similar result, and then to share it (although I assume bit torrent will be good enough). Once people are running their models, they can innovate to their hearts content. It's not clear how or why they'd give money back to the enabling technology. So how does money flow back to the innovators in proportion to the value produced, if not a SaaS?

If those are all that's required, why don't you start a company with a business plan written so it satisfies your criteria? Then you can lead the way with OSS LLMs.
what stage of capitalism is this
Yes a rugged individual would have to be incredibly wealthy to do it!

But maybe the governments will make one and maintain it with taxes as an infrastructure service, like roads, giving everyone expanded powers of cognition, memory, and expertise, and raising the consciousnesses of humanity to new heights. Probably in USA it wouldn't happen if we judge ourselves only in zero sum relation to others - helping everyone would be a wash and only waste our money!

The US spends more on its citizens than almost any other country, and more on helping other countries than any other country.

The problem with making something nationalised or a utility is you'd better have made sure there's no innovation needed or risk required. Once that's all settled, then maybe consider it.

Some governments probably alread do and use it against so-called terrorists or enemies of the people…
crowd source to pay for the gpu rentals.
A company that wants to sell you the hardware that LLMs run on might do this. NVIDIA? Apple?
Pooling resources a la SETI@home would be an interesting option I would love to see.
My understanding is that can work for model inference but not for model training.

https://github.com/bigscience-workshop/petals is a project that does this kind of thing for running inference - I tried it out in Google Collab and it seemed to work pretty well.

Model training is much harder though, because it requires a HUGE amount of high bandwidth data exchange between the machines doing the training - way more than is feasible to send over anything other than a local network connection.

And a lot of expensive data scientists.
This is the type of task where if you'd want to pool resources, then it would be more efficient to pool dollars and buy compute power rather than pool compute power - I'd assume that if treat the decentralized hardware as free, just the the extra electricity cost of using it is more expensive than just renting a centralized server which can do it efficiently.
SETI@home (and similar projects) fall into the domain of embarrassingly parallelizable ( https://en.wikipedia.org/wiki/Embarrassingly_parallel ).

My own experience with this was a distributed ray tracer where the server sent the full model to the machines and then each machine would ask for one scan line to do, report back, and then ask for another scan line and repeated.

There was no interaction between the machines - what was on one scan line didn't need any coordination with what was on another scan line.

Likewise, with SETI@home, the server could give you a chunk of data and you could analyze that chunk - the contents of another chunk of data didn't change the analysis being done on this one.

Furthermore, these can be done asynchronously and then assembled when everything is done. Only the very final product / analysis / artifact needs all of the data and nothing other than the end process is waiting on any sub process.

For doing gradient descent ( https://www.3blue1brown.com/lessons/gradient-descent ), as I understand it, each iteration is dependent on the previous one.

Doing 13,002 dimensional (for the example of a 784 -> 16 -> 16 -> 10 neuron net digit recognizer in the 3b1b page) matrix math is the parallel part... but and if you get into the billions of parameters it gets much larger. Matrix multiplication has difficulty across a network. For example - http://www.lac.inpe.br/~stephan/CAP-372/Fox_example.pdf and http://www.cs.csi.cuny.edu/~gu/teaching/courses/csc76010/sli...

> We are now ready for the second stage. In this stage, we broadcast the next column (mod n) of A across the processes and shift-up (mod n) the B values.

That use of "broadcast" - the matrix multiplication is limited by the speed of the slowest node and it needs to send all the data from the previous calculation to all the nodes making it difficult to use across a network that experiences latency.

When doing ML training, they most of TB/sec of bandwidth... and the high end extremes are in PB/sec ( https://www.cerebras.net/product-chip/ ) ... and I'm sitting here watching Steam download.

The inefficiencies of the network, slow computers, and amount of data transfer to preform the next calculation make network distributed machine learning "not a good choice" at this time.

The crazy thing to me is that this means we're approaching being able to have a huge chunk of human knowledge just sitting there locally on your machine. I asked ChatGPT 4 about my old professor and it was able to write a few paragraphs on her including some very specific details. It's like you can fit most of the value of a search engine AND the retrieved pages into a quite small hardware footprint.
it can't be factual though, otherwise you'll have found compression with infinite ratio. I think the next step is a model that can say "idk" rather than coming up with bullshit