| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by paradox242 639 days ago
	I don't see how this is sustainable. We have essentially eaten the seed corn. These current LLMs have been trained by an enormous corpus of mostly human-generated technical knowledge from sources which we already know to be currently being polluted by AI-generated slop. We also have preliminary research into how poorly these models do when training on data generated by other LLMs. Sure, it can coast off of that initial training set for maybe 5 or more years, but where will the next giant set of unpolluted training data come from? I just don't see it, unless we get something better than LLMs which is closer to AGI or an entire industry is created to explicitly create curated training data to be fed to future models.

3 comments

_DeadFred_ 639 days ago

These tools also require the developer class to that they are intended to replace to continue to do what they currently do (create the knowledge source to train the AI on). It's not like the AIs are going to be creating the accessible knowledge bases to train AIs on, especially for new language extensions/libraries/etc. This is a one and f'd development. It will give a one time gain and then companies will be shocked when it falls apart and there's no developers trained up (because they all had to switch careers) to replace them. Unless Google's expectation is that all languages/development/libraries will just be static going forward.

layer8 639 days ago

One of my concerns is that AI may actually slow innovation in software development (tooling, languages, protocols, frameworks and libraries), because the opportunity cost of adopting them will increase, if AI remains unable to be taught new knowledge quickly.

mathw 639 days ago

It also bugs me that these tools will reduce the incentive to write better frameworks and language features if all the horrible boilerplate is just written by an LLM for us rather than finding ways to design systems which don't need it.

The idea that our current languages might be as far as we get is absolutely demoralising. I don't want a tool to help me write pointless boilerplate in a bad language, I want a better language.

batty_alex 639 days ago

This is my main concern. What's the point of other tools when none of the LLMs have been trained on it and you need to deliver yesterday?

It's an insanely conservative tool

jamil7 639 days ago

You already see this if you use a language outside of Python, JS or SQL.

wahnfrieden 639 days ago

that is solved via larger contexts

layer8 639 days ago

It’s not, unless contexts get as large as comparable training materials. And you’d have to compile adequate materials. Clearly, just adding some documentation about $tool will not have the same effect as adding all the gigabytes of internet discussion and open source code regarding $tool that the model would otherwise have been trained on. This is similar to handing someone documentation and immediately asking questions about the tool, compared to asking someone who had years of experience with the tool.

Lastly, it’s also a huge waste of energy to feed the same information over and over again for each query.

wahnfrieden 638 days ago

- context of millions of tokens is frontier

- context over training is like someone referencing docs vs vaguely recalling from decayed memory

- context caching

layer8 638 days ago

You’re assuming that everything can be easily known from documentation. That’s far from the truth. A lot of what LLMs produce is informed by having been trained on large amounts of source code and large amounts of discussions where people have shared their knowledge from experience, which you can’t get from the documentation.

0points 639 days ago

Yea, I'm thinking along the same lines.

The companies valuing the expensive talent currently working on Google will be the winner.

Google and others are betting big right now, but I feel the winner might be those who watches how it unfolds first.

brainwad 639 days ago

The LLM codegen at Google isn't unsupervised. It's integrated into the IDE as both autocomplete and prompt-based assistant, so you get a lot of feedback from a) what suggestions the human accepts and b) how they fix the suggestion when it's not perfect. So future iterations of the model won't be trained on LLM output, but on a mixture of human written code and human-corrected LLM output.

As a dev, I like it. It speeds up writing easy but tedious code. It's just a bit smarter version of the refactoring tools already common in IDEs...

kelnos 639 days ago

What about (c) the human doesn't realize the LLM-generated code is flawed, and accepts it?

monocasa 639 days ago

I mean what happens when a human doesn't realize the human generated code is wrong and accepts the PR and it becomes part of the corpus of 'safe' code?

jaredsohn 639 days ago

Presumably someone will notice the bug in both of these scenarios at some point and it will no longer be treated as safe.

skydhash 639 days ago

Do you ask a junior to review your code or someone experienced in the codebase?

loki-ai 639 days ago

maybe most of the code in the future will be very different from what we’re used to. For instance, AI image processing/computer vision algorithms are being adopted very quickly given the best ones are now mostly transformers networks.