Hacker News new | ask | show | jobs
by keeda 7 days ago
I get the sense a lot of the warnings about LLMs were based heavily on known risks of Machine Learning at the time (which those references are all examples of.) That was because the data was relatively narrow (e.g. hiring data.) However the scale of data that LLMs are trained on has qualitatively changed the risk landscape.

Like, before LLMs biases in the data were clearly impacting biases in the model outputs and that was a real risk (e.g. recruiting models deprioritizing minority candidates.) But with LLMs it's not clear that the same risks apply, either due to multiple biases in the overwhelming amounts of data canceling out, or due to RLHF, or some mix of both, or some other emergent property.

The fact that Elon had to deliberately go out and create an "anti-woke" LLM indicates that the models do have biases, but those biases are not the same ones pre-LLM ML safety researchers were concerned about... and may even be aligned with the "well-known liberal bias" that reality has.

I suspect the risks we'll see with LLMs will be very different from what this or older papers focused on.

1 comments

The scale of the data and the size of the models don't change the underlying issue, the whole construction of these models is to start with a maximum likelihood language sampler (pre-training) and then massage it into a maximum utility language sampler (post-training) with some eye towards risk management and policy compliance ("safety"). It takes work to make model output fit any particular idea of "correct", whether it's Elon's particular ideology, the US Civil Rights act, Xi Jinping Thought, or writing clean C++. More data and weights increase the complexity of tasks that we're able to model but it doesn't automatically make the output "better" on any given axis.
Right, what I meant is the underlying issue is the same, but the large amount of data along with the number of potentially conflicting and reinforcing biases going into LLMs make it hard to categorize or quantify risks.

Like previously it was pretty straightforward to hypothesize and show that "historically minorities were discriminated against in hiring, so models trained on that recruiting data will exhibit the same biases." But now those biases are intermingled with a whole lot of other biases (e.g. including data / RLHF about the ill-effects of discrimination) so it gets harder to reason about their behavior.

As an example, I don't think anyone quite predicted that these could become suicide ideation machines.