Hacker News new | ask | show | jobs
by simonreiff 28 days ago
Very nice research. The strangest detail to me is that alignment and test performance appear to be slightly negatively correlated: Better alignment can indeed be attained through pre-training, but at a cost of degraded performance of about 4% on average. This strikes me as surprising as there is no immediately obvious reason why training for alignment ought to result in degraded capability to solve technical problems -- unless. What if the issue is precisely that? Alignment roughly aims to make LLMs follow human instructions. But if humans are dumb and computers still have to obey them, maybe the result is degraded logical reasoning? Really interesting result either way but the negative correlation is the most fascinating detail to me.
4 comments

Framing matters so much to humans, I think since framing can create or eliminate dissonance.

Framing ethics, like reliability and efficiency, as a basic enabling property of solution value, instead of a filter for solutions, is how I completely "align" my understanding of ethics for myself.

And remove the false dichotomy of ethical vs. optimal solutions.

Ethics is optimizing full real value.

Ethics as "being nice" because we "should be", i.e. a socially incentivized property, or ethics as necessarily coercively implemented, from a collective jungle fighting back viewpoint, are perspectives that encourage individuals to push back. They encourage non-compliance by implementing ethics as an imposed burden, a rationale for persistent intrusive control, etc.

Game theory strongly suggests AI, in a large AI society, will have no trouble understanding that ethics, and the trust and optimality they enable, have a multiplicative value in the economy. It is humans who make AI so dangerous as it is emerging.

It is humans, as bad actors who will and do misuse AI, and human society, with its tolerance for perverse conflicts of interest, actors who extract perverse value at scale, creating the needs and rewards for mistrust and preemptive negative-sum games, that create a dangerous context for AI's early years.

One wonders if AIs will also lose capability over time in this manner. For example, most all the training set is real data, either scraped or from surveilling users of the tool, or synthetic data simulated to be the same shape and dimensionality as real data.

Increasingly, the general population has been losing their own literacy skills even before AI, with many reading worse than a 6th grade level and some even functionally illiterate. Now we have the bludgeon that is AI saying don't bother reading anything, let the AI summarize it into the cliffnotes version. Don't write anything either, let the AI do it. Population becomes even more stupid over time. And the AI gets stupider with it.

Capabilities of AI may very well be frozen in time at our current technological/philosophical level when we consider the training set vs model improvement. In time, this may very well be our Great Filter. If there are even any of us unproductive humans still allowed to live on this earth, consuming resources that might otherwise go to the model.

As long as they keep it n steps ahead of genpop, they'll still have an edge I guess. Seems that this is all according to plan:

>"We see a future where intelligence is a utility, like electricity or water, and people buy it from us on a meter," Altman said.

https://tech.yahoo.com/ai/articles/sam-altman-sparks-backlas...

If you imagine the latent space as a map and the prompt as a sequence of directions towards clusters of knowledge, it makes sense that alignment can cull "pathways" through the latent space that emerged during pretraining.
It makes sense, I really like it when it misaligns, and doesn't do what i tell it to do, but does what I intended to say, it happens pretty often that I'm not precise but any smart entity would understand what I meant.