| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by energy123 501 days ago
	The main reason topography emerges in physical brains is because spatially distant connections are physically difficult and expensive in biological systems. Artificial neural nets have no such trade-off. So what's the motivation here? I can understand this might be a very good regularizer, so it could help with generalization error on small-data tasks. But hard to see why this should be on the critical path to AGI. As compute and data grows, you want less inductive bias. For example, CNN will beat ViT on small data tasks, but that flips with enough scale because ViT imposes less inductive bias. Or at least any inductive bias should be chosen because it models the structure of the data well, such as with causal transformers and language.

10 comments

AYBABTME 501 days ago

Locality of data and computation is very important in neural nets. It's the number one reason why training and inference are as slow as they are. It's why GPUs need super expensive HBM memory, why NVLink is a thing, why Infiniband is a thing.

If the problem of training and inference on neural networks can be optimized so that a topology can be used to keep closely related data together, we will see huge advancements in training and inference speed, and probably in model size as a result.

And speed isn't just speed. Speed makes impossible (not enough time in our lifetime) things possible.

A huge factor in Deepseek being able to train on H800 (half HBM bandwith as H100) is that they used GPU cores to compress/decompress the data moved around between the GPU memory and the compute units. This reduces latency in accessing data and made up for the slower memory bandwith (which translates in higher latency when fetching data). Anything that reduces the latency of memory accesses is a huge accelerator for neural nets. The number one way to achieve this is to keep related data next to each other, so that it fits in the closest caches possible.

mirekrusin 501 days ago

It's true, but isn't OP also correct? Ie. it's about speed, which implies locality, which implies approaches like MoE which does exactly that and it's unlike physical brain topology?

Having said that it would be fun to see things like rearrangement data moves based on temerature of silicon parts after training cycle.

nickpsecurity 501 days ago

Well, locality and the global nature of pre-training methods. The brain mostly uses local learning (Hebbian learning) which requires less, data movement. AI firms putting as much money into making that scale as they did on backpropagation might drop costs a lot.

vlovich123 501 days ago

Unless GPUs work markedly differently somehow or there’s been some fundamental shift in computer architecture I’m not aware of, spatial locality is still a factor in computers.

Aside from HW acceleration today, designs like Cebras would benefit heavily by reducing the amount of random access from accessing the weights (and thus freeing up cross-chip memory bandwidth for other things).

whynotminot 501 days ago

This makes me remember game developers back when games could still be played directly from the physical disc. They would often duplicate data to different parts of the disc, knowing that certain data would often be streamed from disc together, so that seek times were minimized.

But those game devs knew where everything was spatially on the disc, and how the data would generally be used during gameplay. It was consistent.

Do engineers have a lot of insight into how models get loaded spatially onto a given GPU at run time? Is this constant? Is it variable on a per GPU basis? I would think it would have to be.

Hard to optimize for this.

jaek 501 days ago

This brings to mind The Story of Mel from programming folklore.

http://beza1e1.tuxen.de/lore/story_of_mel.html

abrookewood 501 days ago

Such a good read - some people really are on another level in their chosen field.

vlovich123 501 days ago

Right now models have no structure so that access is random but you definitely know where the data is located in memory since you put it there. It doesn’t matter about the physical location - it’s all through a TLB but if you ask the GPU for a contiguos memory allocation it gives it to you. This is probable the absolute easiest thing to optimize for if your data access pattern is amenable to it.

harles 501 days ago

That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper.

vlovich123 501 days ago

Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.

mayukhdeb 500 days ago

In this paper, we don't zero out the weights. We remove them.

vlovich123 499 days ago

Thanks for the correction! Can it be retrofitted into existing models through distillation or do you have to train the model from scratch?

cma 501 days ago

> The main reason topography emerges in physical brains is because spatially distant connections are physically difficult and expensive in biological systems.

The brain itself seems to have bottlenecks that aren't distance related, like hemispheres and the corpus callosum that are preserved over all placental mammals and other mammalian groups have something similar and still hemispheres. Maybe it's just an artifact of bilateral symmetry that is stuck in there from path dependence, or forcing a redundancy to make damage more recoverable, but maybe it has a big regularizing or alternatively specializing effect (regularization like dropout tends to force more distributed representations which seems kind of opposite to this work and other work like "Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability," https://arxiv.org/abs/2305.08746 ).

jlpom 501 days ago

It increases modularity and small-worldness, which are in my book critical for AGI (surprised by the way that this publication doesn't cite https://www.nature.com/articles/s42256-023-00748-9).

mayukhdeb 501 days ago

Thank you for sharing this! We'll read through this and update the camera-ready version accordingly for ICLR 2025.

exe34 501 days ago

> CNN will beat ViT on small data tasks, but that flips with enough scale because ViT imposes less inductive bias

any idea why this is the case? CNN have the bias that neighbouring pixels are somehow relevant - they are neighbours. ViTs have to re-learn this from scratch. So why do they end up doing better than CNN?

TZubiri 501 days ago

Maybe this would be relevant for datacenters with significant distance between machines, or multidatacenter systems.

xpl 501 days ago

> So what's the motivation here?

Better interpretability, I suppose. Could give insights into how cognition works.

mayukhdeb 501 days ago

The motivation was to induce structure in the weights of neural nets and see if the functional organization that emerges aligns with that of the brain or not. Turns out, it does -- both for vision and language.

The gains in parameter efficiency was a surprise even to us when we first tried it out.

energy123 501 days ago

That's true, and interpretability is helpful for AI safety.

mayukhdeb 501 days ago

Indeed. What's cool is that we were able to localize literal "regions" in the GPTs which encoded toxic concepts related to racism, politics, etc. A similar video can be found here: https://toponets.github.io

More work is being done on this as we speak.

fakeparmesean 501 days ago

My understanding coming from mechanistic interpretability is that models are typically (or always) in superposition, meaning that most or all neurons are forced to encode semantically unrelated concepts because there are more concepts than neurons in a typical LM. We train SAEs (where we apply L1 reg and a sparsity penalty to “encourage” the encoder output latents to yield sparse representations of the originating raw activations), to hopefully disentangle these features, or make them more monosemantic.This allows us to use the SAE as a sort of microscope to see what’s going on in the LM, and apply techniques like activation patching to localize features of interest, which sounds similar to what you’ve described. I’m curious what this work means for mech interp. Is this a novel alternative to mitigating polysemanticity? Or perhaps neurons are still encoding multiple features, but the features tend to have greater semantical overlap? Fascinating stuff!

mayukhdeb 501 days ago

> the features tend to have greater semantical overlap?

This is true. The features closer together now have much stronger semantic overlap. You can watch how the weights self-organize in a GPT here: https://toponets.github.io/webpage_assets/banner_video.mp4

We're already studying the effects of topographic structure on polysemanticity.

cwillu 501 days ago

Was it toxicity though as understood by the model, or just a cluster of concepts that you've chosen to label as toxic?

I.e., is this something that could (and therefore, will) be turned towards identifying toxic concepts as understood by the chinese or us government, or to identify (say) pro-union concepts so they can be down-weighted in a released model, etc?

mayukhdeb 501 days ago

We localized "toxic" neurons by contrasting the activations of each neuron for toxic v/s normal texts. It's a method inspired by old-school neuroscience.

immibis 501 days ago

Defining all politics as toxic is concerning, if it's not just a proof of concept. That's something dictatorships do so that people won't speak up.

jv22222 501 days ago

I had this idea the other day. Not sure if it relates but maybe?

https://twitter.com/justinvincent/status/1884357300703400274

mercer 501 days ago

I imagine it could be easier to make sense of the 'biological' patterns that way? like, having bottlenecks or spatially-related challenges might have to be simulated too, to make sense of the ingested 'biological' information.

ziofill 501 days ago

Perhaps they are more easily compressible? Once a bunch of nearby weights have similar roles one may not need all of them.

mayukhdeb 501 days ago

Yep. That is exactly the idea here. Our compression method is super duper naive. We literally keep every n-th weight column and discard the rest. Turns out that even after getting rid of 80% of the weight columns in this way, we were able to retain the same performance in a 125M GPT.

w-m 501 days ago

If you have things organized neatly together, you can also use pre-existing compression algorithms, like JPEG, to compress your data. That's what we're doing in Self-Organizing Gaussians [0]. There we take an unorganised (noisy) set of primitives that have 59 attributes and sort them into 59 2D grids which are locally smooth. Then we use off-the-shelf image formats to store the attributes. It's an incredibly effective compression scheme, and quite simple.

[0]: https://fraunhoferhhi.github.io/Self-Organizing-Gaussians/