Hacker News new | ask | show | jobs
by aspenmartin 24 days ago
Well that historical content and code still exists right? Are you just saying “what if we’re in a world of walled gardens now that OSS dies because people don’t want their work stolen” in which case: these companies will get data and they don’t need OSS anymore. It’s already webcrawled or licensed or commissioned, they pay people to generate novel traces when they need it or at the very least sets of prompts and tests for verification. Then synthetic data gets added to the training set, the ones that are verified.
2 comments

This is super hilarious :-)))

Do you think creating the orders of magnitude of content the internet produced organically and which LLM creators are stealing is cheap? If they actually have to pay for content creation while competing with content creators on the you know, content creation front via LLM-generation, the entire business model of LLMs collapses.

You can't have the mountains of data needed for LLMs in the decades to come, if your LLMs put the writers and artists out of work.

It’s literally how these models are trained today. They of course use open source data but that’s no longer the most important source, it’s high quality prompts and verifiable tests and a lot of inference compute. They also have massive flywheels from users from which they can mine good data or at the very least again good prompts which can be just as important.
And everything we know about these companies points to unsustainability, before we even get to very high impact content lawsuits which haven't even been settled. Let alone lots of data sets being pulled out of public view and being moved to anti LLM licenses (with explicit licensing for training).

We will see how this shakes out in the coming years, as Anthropic, OpenAI & co file for IPO or run out of private funding. Grok is already on the ropes as seen from the SpaceX IPO.

You think this train is going to stop because of a lawsuit? And again, if all data was officially off limits for these companies, it wouldn't matter. They have code traces from their users which is arguably much better, they can license code (you'd be surprised to know that these companies are not just stealing everyones data they are paying for it), and they can create data via paying people to do it.

And yes, we will see how it shakes out, Anthropic or OpenAI may collapse just as netscape did, but I hope your implication is not "AI in general will be extinguished like the blockchain" or something

I've read probably hundreds of historical books at this point and the only thing most historians agree on is:

Nothing was set in stone. The way many historical things happened the way they did was due to accident, sheer chance.

> And yes, we will see how it shakes out, Anthropic or OpenAI may collapse just as netscape did, but I hope your implication is not "AI in general will be extinguished like the blockchain" or something

I think the current LLM economy will collapse, leaving behind a few survivors. There will be widespread adoption of cheap OSS LLMs and of more limited, economically viable functionality provided by people with deep pockets like Google. As LLM economics start making more sense, LLMs will be everywhere, once the hardware becomes cheaper and more available.

Regarding lawsuits, do you think Disney & co will take this lying down? The freaking DMCA - an American law - is enforced <<internationally>>. It will take a long time but LLMs will be domesticated.

> Nothing was set in stone. The way many historical things happened the way they did was due to accident, sheer chance.

I agree with you on a technical level and even in a non-cynical "humanity really can rally and change things that seem insurmountable" but you have read way more history than I have. All I know is you have such a frantic geopolitical aspect to this, and such a staggering amount of funding and addressable market, which means unlike blockchain this is both powered by business _and_ government (no government would give up control of the money supply, shocking to me that people believe this), that I see zero path to winding anything down.

> I think the current LLM economy will collapse, leaving behind a few survivors. There will be widespread adoption of cheap OSS LLMs and of more limited, economically viable functionality provided by people with deep pockets like Google. As LLM economics start making more sense, LLMs will be everywhere, once the hardware becomes cheaper and more available.

Cheap OSS LLMs are used everywhere, all the time. They are great, and with subsidies from say China, they could even be competitive with frontier models, but that model of the world requires this mysterious OSS development running at a big big loss. It takes almost a billion dollars to train a frontier model. For many many cases, you do not need frontier model performance. When I do say video captioning, I use small OSS VL models.

Is your theory predicated on OSS models filling some sort of performance gap between the frontier? Or a compromise for less spend at a lower performance? What to you doesn't make sense about LLM economics? Like, LLM's are everywhere. If you think "oh people will just settle for slightly less performance for cheaper" that should have already played out, we've had the same dynamics at every scale: frontier performance is expensive, but then that same performance will cost roughly 10x less in 1 years time. But you don't see people stopping at like GPT-4 and not adopting the frontier models of today.

I think you're right about the value of OSS LLMs but I don't see what would change the calculus to make frontier models somehow less important. It's like in the 90's when we were like "1 GIGABYTE OF RAM? how will that ever be necessary!?!?" and sure, you don't need 1 GB of ram for everything! We have embedded systems. But it's not like there isn't a booming market for >> 1 GB memory modules.

> Regarding lawsuits, do you think Disney & co will take this lying down? The freaking DMCA - an American law - is enforced <<internationally>>. It will take a long time but LLMs will be domesticated.

Not saying lawsuits will be fruitless, I'm sure they will chip something off of the industry, but by now it just won't matter. We're talking about trillions in spend, multiple countries, a government that sees this as a non-negotiable game to win from a geopolitical and military standpoint, and our government knows that they can't execute this themselves despite what I imagine is their distaste for Silicon Valley tech CEOs and their grandstanding. Maybe a lawsuit kneecaps someone, which would be huge, but that doesn't matter for AI generally. Maybe a lawsuit restricts data use, that's fine, these companies have deep pockets for licensing and commissioning datasets; they have opt-in-by-default user flywheels.

That sounds like it would reduce the blazing progress of the last decades to a snail's pace, some twilight where software is just average, as it always was and always will be. That people will always do the thing the opposite of which is now incentivized doesn't convince me, basically. If just using the LLM gets you ahead in a time of severe pressure, then most people will do that, and by the time anyone realizes they kinda need a FEW people to actually be able to reason about something from start to finish, it might be to late.

We're not such a smart species. It's not like we managed so far. We're just adding unsolved problems, and distract ourselves with even bigger problems. The world could have been fed and clothed by the mid 20th century and we could have solved climate change by the 1980s (talking out of my ass here but with confidence in my general point with that), but instead we now throw everything into the furnace. in the hopes it will create a deus ex machina, like in that very bad Isaac Asimov story. I think we are absolutely capable of lobotomizing ourselves (as a species) like a toddler playing with an electrical socket shocking itself. I don't say this to be snarky, I honestly think we're that unserious and ignorant about what we do and the environment we do it in.

But I also really should look into what you answered about LLM learning from themselves, I heard it mentioned before but I still have no real clue. I will try to rectify that. I mean, I really, really want to be wrong on this, only a monster wouldn't.

> by the time anyone realizes they kinda need a FEW people to actually be able to reason about something from start to finish, it might be to late.

I dont think it will be "too late" by any reasonable definition. All those things are learnable and companies that will really need to overcome it, will. But, they wont be open with their knowledge. Learning/training will be expensive and once people acquire it, they wont share it like open sources and programming tech blogs did.