Hacker News new | ask | show | jobs
by danielmarkbruce 851 days ago
There is a decent (<50%, >20%) chance that frontier foundation models are less oligopoly like than it seems. The reason is that there are so many levers to pull, so much low hanging fruit.

For example: * Read the Bloomberg GPT paper - they create their own tokenizer. For specialized domains (finance, law, medicine, etc) the vocabulary is very different and there is likely a lot to do here, where individual tokens really need to map to specific concepts and having a concept capture in several tokens makes it too hard to learn on limited domain data. * Data - so many ways to do different data - more/less, cleaner, "better" on some dimension. * Read the recent papers on different decoding strategies - there seems to be a lot to do here. * Model architecture (SSM etc). If you speak to people who aren't even researchers, they have 10 ideas around architecture and some of them are decent sounding ideas - lots of low hanging fruit. * System architecture - ie likely to see more and more "models" served via API which are actually systems of several model calls, and there is a lot to do here. * Hardware, lower precision etc likely to make training much cheaper

It's reasonably likely (again, guessing < 50% > 20%) that this large set of levers to pull become ways to see constant leap-frogging for years and years. Or, at least they become choices/trade-offs rather than strictly "better".

1 comments

I agree this is a potential outcome. One big question is generalizability versus niche models. For example, is the best legal model a frontier model + a giant context window + RAG? Or is it a niche model trained or fine tuned for law?

Right now at least people seem to decouple some measures of how smart the model is from knowledge base, and at least for now the really big models seem smartest. So part of the question is well is how insightful / synthesis centric the model needs to be versus effectively doing regressions....

Frontier model + rag is good when you need cross-discipline abilities and general knowledge, niche models are best when the domain is somewhat self contained (for instance, if you wanted a model that is amazing at role playing certain types of characters).

The future is model graphs with networked mixtures of experts, where models know about other models and can call them as part of recursive prompts, with some sort of online training to tune the weights of the model graph.

> The future is model graphs with networked mixtures of experts, where models know about other models and can call them as part of recursive prompts, with some sort of online training to tune the weights of the model graph.

What's the difference between that and combining all of the models into a single model? Aren't you just introducing limitations in communication and training between different parts of that über-model, limitations that may as well be encoded into the single model if they're useful? Are you just partitioning for training performance? Which is a big deal, of course, but it just seems like guessing the right partitioning and communication limitations is not going to be straightforward compared to the usual stupid "throw it all in one big pile and let it work itself out" approach.

The limitation is the amount of model you can fit on your hardware, and also sometimes information about one domain can incorrectly introduce biases in another which are very hard to fix, so training on one domain only will produce much better results.
Yup, it's unclear. The current ~consensus is "general purpose frontier model + very sophisticated RAG/system architecture" for legal as an example. I'm building something here using this idea and think its 50/50 (at best) I'm on the right path. It's quite easy to build very clever sounding but often wrong insights into various legal agreements (m&a docs for example). When looking at the tokenization, the training data, decode, architecture (lots of guesses) of the big models, there are a lot of things where the knobs seem turned slightly incorrectly for the domain.

Some of the domains are so large that a specialized model might seem niche but the value prop is potentially astronomical.