Hacker News new | ask | show | jobs
by swyx 741 days ago
> Wonder if this was a bit rushed out in response to Anthropic's release

too lazy to dig up source but some twitter sleuth found that the first commit to the project was 6 months ago

likely all these guys went to the same metaphorical SF bars, it was in the water

3 comments

> likely all these guys went to the same metaphorical SF bars, it was in the water

It also is coming from a long lineage of thought no? For instance, one of the things often thought early in an ML course is the notion that “early layers respond to/generate general information/patterns, and deeper layers respond to/generate more detailed/complex patterns/information.” That is obviously an overly broad and vague statement but it is a useful intuition and can be backed up by doing some various inspection of eg what maximally activates some convolution filters. So already there is a notion that there is some sort of spatial structure to how semantics are processed and represented in a neural network (even if in a totally different context, as in image processing mentioned above), where “spatial” here is used to refer to different regions of the network.

Even more simply, in fact as simple as you can get: with linear regression, the most interpretable model you can get- you have a clear notion that different parameter groups of the model respond to different “concepts” (where a concept is taken to be whatever the variables associated with a given subset of coefficients represent).

In some sense, at least in a high-level/intuitive reading of the new research coming out of Anthropic and OpenAI, I think the current research is just a natural extension of these ideas, albeit in a much more complicated context and massive scale.

Somebody else, please correct me if you think my reading is incorrect!!

This project has been in the works for about a year. The initial commit to the public repo was not really closely related to this project, it was part of the release of the Transformer debugger, and the repo was just reused for this release.
ha thank you Leo; i myself felt uneasy pointing out commit date based evidence and you just proved why.

mild followup question: any alpha to be gained from training the same SAEs on two different generations of GPT4, eg GPT4 on march 2023 vs june 2023 vintage, whatever is most architecturally comparable, and diffing them. what would be your priors on what you’d find?

It’s hard to believe it was written overnight.. this seems more like a public stable dump of what they’ve been working on without saying when they started. Some clues could come from looking at when all the deps it uses were released. They’re also calling this version 0.1.67, though I’m not sure that means anything either.