| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by keyle 159 days ago
	Does this make any sense, to anyone?

5 comments

kannanvijayan 159 days ago

I think this is an attempt to try to enrich the locality model in transformers.

One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.

This is obviously not powerful enough to express non-linear relationships - like graph relationships.

This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.

thesz 158 days ago

  > like graph relationships

Once upon a time during me being language modeling researcher I built and finetuned a big (at the time - about 5 billions parameters) Sparse Non-Negative Matrix Language Model [1].

[1] https://aclanthology.org/Q16-1024/

As this model allows for mix-and-match of various contexts, one thing that I did is to have a word-sorted context. This effectively transforms position-based context into a word-set based context, where "you and me", "me and you" and "and me you" are the same.

This allowed for longer contexts and better prediction.

nickpsecurity 158 days ago

I've saved it to look at it in the future. I also remembered Kristina Tautanova's name (your editor). Looking up recent publications, she's done interesting work on analyzing pretraining mixtures.

https://aclanthology.org/2025.acl-long.1564/

Thanks to you both for two, interesting papers tonight. :)

thesz 158 days ago

I am not an author of SNMLM paper. ;)

I was using their model in my work.

nickpsecurity 157 days ago

I misunderstood what you said.

Well, in your work, whay benefit did you get from it? And do you think it would be beneficial today combined with modern techniques? Or obsoleted by other technqiue?

(I ask because I'm finding many old techniques are still good or could be mixed with deep learning.)

thesz 157 days ago

At the time (2018), it had perplexity close to LSTM, while having more coefficients and much shorter (hours vs days) training time.

I tried to apply SNMLM's ideas to the byte-level prediction modeling here: https://github.com/thesz/snmlm-per-byte

It was not bad, but I had trouble scaling it to the 1B set. Mostly because I have not enough time.

I do hold same mindset as yours, that many old techniques are misunderstood or underapplied. For example, decision trees, in my experiments, allow for bit-length-per-byte comparable to LSTM (lstm-compress or LSTM in nncp experiments): https://github.com/thesz/codeta

adroniser 159 days ago

Adding the position vector is basic sure, but it's naive to think the model doesn't develop its own positional system bootstrapping on top of the barebones one.

thesz 158 days ago

For some reason people are still adding position encodings into embeddings.

As if they are not relying on the model's ability to develop its own "positional system bootstrapping on top of the barebones one."

tuned 158 days ago

> This is obviously not powerful enough to express non-linear relationships - like graph relationships.

the distance metrics used is based on energy-informed graphs that encode energy relations in a distribution called taumode, see my previous paper on spectral indexing for vector databases for a complete roll-out

liteclient 159 days ago

it makes sense architecturally

they replace dot-product attention with topology-based scalar distances derived from a laplacian embedding - that effectively reduces attention scoring to a 1D energy comparison which can save memory and compute

that said, i’d treat the results with a grain of salt give there is no peer review, and benchmarks are only on 30M parameter model so far

reactordev 159 days ago

Yup, keyword here is “under the right conditions”.

This may work well for their use case but fail horribly in others without further peer review and testing.

tuned 158 days ago

no, from my point of view is being more domain-focused instead of going full-orthogonal.

tuned 158 days ago

right. this is a proposal that needs to be tested. I started testing it on 30M parameters then I will move to a 100M and evaluate the generation on domain-specific assisting tasks

bee_rider 159 days ago

I haven’t read the paper yet, but the graph laplacian is quite useful in reordering matrices, so it isn’t that surprising if they managed to get something out of it in ML.

tuned 158 days ago

it made sense to me as it is a very simple idea I guess: causal self-attention compute QKV distances computing on the full vectors for Q,K and V; the topological transformer can provide the same computation using Q, scalar K and V. Instead of [N², N², N²] -> [N², N, N²] is used. If generation is confirmed to be on par in terms of quality, the gains are evident.

pwndByDeath 159 days ago

No, its a new form of alchemy that turns electricity into hype. The technical jargon is more.of.a thieves cant to help identity other conmen to one another

postflopclarity 159 days ago

that's a strange way to spell "no, I didn't understand the paper"

vixen99 158 days ago

Perhaps someone who does understand the paper will kindly make it a bit clearer for those of who get a bit lost.

Yemoshino 158 days ago

Honestly why I would really apprechiate something like this, hn is not an explain platform.

For sure, some words or feedback on what you understood (did you get it right) etc. yeah.

But otherwise, if you do not understand a research paper, you have to do the same hard work as everyone else. Sitting down, going through it paragraph by paragraph and learning it. This takes massive time.

and for a high level overview, chatgpt and co are really really good getting papers.

Yemoshino 159 days ago

Try get over your ai hate.

If you need help getting more out of ai, you can use chatgpt and co to go through papers and let yourself eli5 paragarphs. 1blue3brown also has a few great videos about transformer and how they work

Workaccount2 159 days ago

Ideologues usually aren't great at primary source understanding/reasoning, hence why they end up with such strong opinions.