| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andreyk 784 days ago
	Exciting to see this so soon after Anthropic's "Mapping the Mind of a Large Language Model" (under 3 weeks). I find these efforts really exciting; it is still common to hear people say "we have no idea how LLMs / Deep Learning works", but that is really a gross generalization as stuff like this shows. Wonder if this was a bit rushed out in response to Anthropic's release (as well as the departure of Jan Leike from OpenAI)... the paper link doesn't even go to Arxiv, and the analysis is not nearly as deep. Though who knows, might be unrelated.

10 comments

thegrim33 784 days ago

From the article:

"We currently don't understand how to make sense of the neural activity within language models."

"Unlike with most human creations, we don’t really understand the inner workings of neural networks."

"The [..] networks are not well understood and cannot be easily decomposed into identifiable parts"

"[..] the neural activations inside a language model activate with unpredictable patterns, seemingly representing many concepts simultaneously"

"Learning a large number of sparse features is challenging, and past work has not been shown to scale well."

etc., etc., etc.

People say we don't (currently) know why they output what they output, because .. as the article clearly states, we don't.

link

TrainedMonkey 784 days ago

I read this as "we have not built up tools / math to understand neural networks as they are new and exciting" and not as "neural networks are magical and complex and not understandable because we are meddling with something we cannot control".

A good example would be planes - it took a long while to develop mathematical models that could be used to model behavior. Meanwhile practical experimentation developed decent rule of thumb for what worked / did not work.

So I don't think it's fair to say that "we don't" (know how neural networks work), we don't have math / models yet that can explain/model their behavior...

link

gradus_ad 784 days ago

Chaotic nonlinear dynamics have been an object of mathematical research for a very long time and we have built up good mathematical tools to work with them, but in spite of that turbulent flow and similar phenomena (brains/LLM's) remain poorly understood.

The problem is that the macro and micro dynamics of complex systems are intimately linked, making for non-stationary non-ergodic behavior that cannot be reduced to a few principles upon which we can build a model or extrapolate a body of knowledge. We simply cannot understand complex systems because they cannot be "reduced". They are what they are, unique and unprincipled in every moment (hey, like people!).

link

baxtr 784 days ago

Physicists would probably argue that the system might be understood but that we don’t have the model for it yet.

Many natural phenomena look chaotic at best without a model. Once you have a model things fall into place and everything starts looking orderly.

Maybe it cannot be reduced. But maybe we are just observing the peripherals without understanding the inner workings.

link

MrsPeaches 784 days ago

If I can speak in aphorisms,

Creation is downhill, analysis is uphill.

Profound ideas often seem simple once understood.

link

baxtr 784 days ago

Well put.

In other words: simplicity is a hallmark of understanding.

link

Sharlin 784 days ago

"We don't know how X works" literally means "we don't have models yet that can explain X's behavior".

TFA is about making a tiny bit of progress towards such models. Perhaps you should read it.

link

ninetyninenine 784 days ago

The analogy to airplanes is not relevant imo. Our lack of understanding behind the physics of an airplane is different from our lack of understanding of what an LLM is doing.

The lack of understanding is so profound for LLMs that we can’t even fully define the thing we don’t understand. What is intelligence? What is understanding?

Understanding the LLM would be akin to understanding the human brain. Which presents a secondary problem. Is it possible for an entity to understand itself holistically in the same way we understand physical processes with mathematical models? Unlikely imo.

I think this project is a pipe dream. At best it will yield another analogy. This is what I mean: We currently understand machine learning through the analogy of a best fit curve. This project will at best just come up with another high level perspective that offers limited understanding.

In fact, I predict that all AI technology into the far future can only be understood through heavy use of extremely high level abstractions. It’s simply not possible for a thing to truly understand itself.

link

HarHarVeryFunny 784 days ago

I think you have to make a distinction between transformers and neural networks in general, maybe also between training and inference.

Many/most types of neural network such as CNNs are well understood since there is a simple flow of information. e.g. In a CNN you've got a hierarchy of feature detectors (convolutional layers) with a few linear classifier layers on top. Feature detectors are just learning decision surfaces to isolate features (useful to higher layers), and at inference time the CNN is just detecting these hierarchical features than classifying the image based on combinations of these features. Simple.

Transformers seem qualitatively different in terms of complexity of operation, not least because it seems we still don't even know exactly what they are learning. Sure, they are learning to predict next word, but just like the CNN whose output classification is based on features learnt by earlier layers, the output words predicted by a transformer are based on some sort or world model/derived rules learned by earlier layers of the transformer, which we don't fully understand.

Not only don't we know exactly what transformers are learning internally (although recent interpretability work gives us a glimpse of some of the sorts of things they are learning), but also the way data moves through them is partially learnt rather than proscribed by the architecture. We have attention heads utilizing learnt lookup keys to find data at arbitrary positions in the context, and then able to copy portions of that data to other positions. Attention heads learn to coordinate to work in unison in ways not specified by the architecture, such as the "induction heads" (consecutive attention head pairs) identified by Anthropic that seem to be one of the work horses of how transformers are working and copying data around.

Additionally, there are multiple types of data learnt by a transformer, from declarative knowledge ("facts") that seem to mostly be learnt by the linear layers to the language/thought rules learnt by the attention mechanism that then affect the flow of data through the model, as discussed above.

So, it's not that we don't know how neural networks work (and of course at one level they all work the same - to minimize errors), but more specifically that we don't fully know how transformer-based LLMs work since their operation is a lot more dynamic and data dependent than most other architectures, and the complexity of what they are learning far higher.

link

passwordoops 784 days ago

"neural networks as they are new"

Yup, ANNs have only been around since the 1950s... Brand spanking new

link

freilanzer 784 days ago

> we don't have math / models yet that can explain/model their behavior...

So, what you're saying is we don't know how they work yet? It's not that deep.

link

realPtolemy 784 days ago

Could there also be a “legal hedging” reason for why you would release a paper like this?

By reaffirming that “we don’t know how this works, nobody does” it’s easier to avoid being charged with copyright infringement from various actors/data sources that have sued them.

link

ben_w 784 days ago

I'd be surprised if doing so had any impact on the lawsuits, but I'm not a lawyer.

link

icandoit 784 days ago

If you know how it works, you can make it better,faster,cheaper.

Without the 300k starting salaries. I imagine that is a stronger incentive.

It's the users of the LLMs that want to launder repsonsibility behind "computer said no".

link

dimitrios1 784 days ago

"I'm sorry officer, I didn't know I couldn't do that"

link

surfingdino 784 days ago

Not holding my breath for that hallucinated cure for cancer then.

link

ben_w 784 days ago

LLMs aren't the only kind of AI, just one of the two current shiny kinds.

If a "cure for cancer" (cancer is not just one disease so, unfortunately, that's not even as coherent a request as we'd all like it to be) is what you're hoping for, look instead at the stuff like AlphaFold etc.: https://en.wikipedia.org/wiki/AlphaFold

I don't know how to tell where real science ends and PR bluster begins in such models, though I can say that the closest I've heard to a word against it is "sure, but we've got other things besides protein folding to solve", which is a good sign.

(I assume AlphaFold is also a mysterious black box, and that tools such as the one under discussion may help us demystify it too).

link

submeta 784 days ago

Scary actually. Because how can we asses the risks when we don’t know what the system is capabale of doing.

link

ein0p 784 days ago

We know exactly what the system is capable of doing. It’s capable of outputting tokens which can then be converted into text.

link

reducesuffering 784 days ago

And social media manipulation is just registers and bytes, wait no, sand and electrons.

link

isaacremuant 784 days ago

Just because you can do something with technology doesn't mean the problem is technology itself. It's like newspapers. Printing them I technology and allows all kind of things. If you're of the authoritarian mindset, you'll want to control it all out of some stated fear, but you can do that for everything.

link

friendzis 784 days ago

If you can do something with technology, that something is part of risk assessment. Authoritarianism is irrelevant, that's engineering.

link

ben_w 784 days ago

Which is so broad as to be unhelpful.

We also know that petroleum mixed with air may be combusted to release energy; we needed to characterise this much better in order for the motor car to be distinguishable from a fuel-air bomb.

link

ein0p 784 days ago

And that's exactly my point. Regulating the underlying tech is utterly pointless in this case - it's utterly harmless by itself.

link

ben_w 784 days ago

That's exactly wrong, we know some things that can be expressed as "a sequence of tokens" are harmful and indeed have already made them crimes.

What we need is to characterise what is possible so we can skip the AI equivalent of Union Carbide in Bohopal.

link

leogao 784 days ago

We were planning to release the paper around this time independent of the other events you mention.

I think it is still predominantly accurate to say that we have no idea how LLMs work. SAEs might eventually change that, but there's still a long way to go.

link

joaquincabezas 784 days ago

it makes sense that the leaders are building around similar ideas in parallel, for me it's a healthy sign

link

jerrygenser 784 days ago

> but that is really a gross generalization as stuff like this shows.

I think this research actually still reinforces that we still have very little understanding of the internals. The blog post also reiterates that this is early work with many limitations.

link

swyx 784 days ago

> Wonder if this was a bit rushed out in response to Anthropic's release

too lazy to dig up source but some twitter sleuth found that the first commit to the project was 6 months ago

likely all these guys went to the same metaphorical SF bars, it was in the water

link

szvsw 784 days ago

> likely all these guys went to the same metaphorical SF bars, it was in the water

It also is coming from a long lineage of thought no? For instance, one of the things often thought early in an ML course is the notion that “early layers respond to/generate general information/patterns, and deeper layers respond to/generate more detailed/complex patterns/information.” That is obviously an overly broad and vague statement but it is a useful intuition and can be backed up by doing some various inspection of eg what maximally activates some convolution filters. So already there is a notion that there is some sort of spatial structure to how semantics are processed and represented in a neural network (even if in a totally different context, as in image processing mentioned above), where “spatial” here is used to refer to different regions of the network.

Even more simply, in fact as simple as you can get: with linear regression, the most interpretable model you can get- you have a clear notion that different parameter groups of the model respond to different “concepts” (where a concept is taken to be whatever the variables associated with a given subset of coefficients represent).

In some sense, at least in a high-level/intuitive reading of the new research coming out of Anthropic and OpenAI, I think the current research is just a natural extension of these ideas, albeit in a much more complicated context and massive scale.

Somebody else, please correct me if you think my reading is incorrect!!

link

leogao 784 days ago

This project has been in the works for about a year. The initial commit to the public repo was not really closely related to this project, it was part of the release of the Transformer debugger, and the repo was just reused for this release.

link

swyx 784 days ago

ha thank you Leo; i myself felt uneasy pointing out commit date based evidence and you just proved why.

mild followup question: any alpha to be gained from training the same SAEs on two different generations of GPT4, eg GPT4 on march 2023 vs june 2023 vintage, whatever is most architecturally comparable, and diffing them. what would be your priors on what you’d find?

link

nicce 784 days ago

Visualizer was added 18 hours ago:

https://github.com/openai/sparse_autoencoder/commit/764586ae...

link

pininja 784 days ago

It’s hard to believe it was written overnight.. this seems more like a public stable dump of what they’ve been working on without saying when they started. Some clues could come from looking at when all the deps it uses were released. They’re also calling this version 0.1.67, though I’m not sure that means anything either.

link

darby_nine 784 days ago

> Mapping the Mind of a Large Language Model

The fact that a paper is implying a LLM has a mind doesn't exactly bode well for the people who wrote it, not to mention the continued meaningless babbling about "safety". It'd also be nice if they could show their work so we could replicate it. Still, not shabby for an ad!

link

castigatio 783 days ago

Well - what is a mind exactly? We don't really have a good definition for a human mind. Not sure we should be claiming domain over the term. It's not a terrible shorthand for discussing something that reads and responds as if it had some kind of mind - whether technically true or not (which we honestly don't know).

link

darby_nine 783 days ago

> It's not a terrible shorthand for discussing something that reads and responds as if it had some kind of mind

I really don't see it like that—it has very little memory, it has no ability to introspect before "choosing" what to say, no awareness of the concept of the coherency of statements (i.e. whether or not it's saying things that directly contradict its training), seems to have little sense of non-pattern-driven computation beyond what token patterns can encode at a surface level (e.g. of course it knows 1 + 1 = 2, but does it recognize odd notation/can it recognize and analyze arbitrary statements? of course not). I fully grant it is compelling evidence we can replicate many brain-like processes with software neural nets, but that's an entirely different thing than raising it to a level of thought or consciousness or self-awareness (which I argue is necessary in order to appropriately issue coherent statements, as perspective is a necessary thing to address even when attempting to make factual statements), but it strikes me as a lot closer to an analogy for a potential constituent component of a mind rather than a mind per se.

link

realPtolemy 784 days ago

Indeed, and the very last section about how they’ve now “open sourced” this research is also a bit vague. They’ve shared their research methodology and findings… But isn’t that obligatory when writing a public paper?

link

lanceflt 784 days ago

https://github.com/openai/sparse_autoencoder

They actually open sourced it, for GPT-2 which is an open model.

link

realPtolemy 784 days ago

Thanks, I must have read through the document to hastily.

link

3abiton 784 days ago

But even with current efforts so far, I don't think we have an understanding of how/why these emergent capabilities are formed. LLMs are still a black box as ever.

link

choppaface 784 days ago

The Deep Visualization Toolbox from nearly 10 years ago is solid precedent for understanding deep models, albeit much smaller models than LLMs. It’s hard to say OpenAI’s “visualization” released today is nearly as effective. It could be that GPT-4 is much harder to instrument.

https://github.com/yosinski/deep-visualization-toolbox

link

imjonse 784 days ago

Both Leike and Sutskever are still credited in the post.

link

throw46365 784 days ago

> that is really a gross generalization

It's really not though, and on multiple levels.

At the shit-tier level, the majority of people building applications on this technology are projecting abilities onto it that even they can't really demonstrate it has in a reliable way.

At the inventor level, the people who make it are dependent on projecting the idea that magic will happen when they have more compute.

At every level, the products are so far ahead of the knowledge that it's actually unethical.

link