| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jmward01 9 days ago

Well, I know this is possible because I have built things that work just like it is promising to do. The two key technologies needed are:

- guided window attn. Predict where to attend to but in a fixed window. If you do this to just the token/vocab you can keep effectively unlimited context and perfect recall. (yes, I can do that. There is a trick to teaching it how to predict position. This also immediately opens other crazy things like NN memory)

-efficient fixed state size models. So not a recurrent mechanism because that breaks training, parallelizable like transformers, but fixed sized state instead of unbounded attn. Pick a reasonable amount of state and it is amazingly good since it doesn't need to keep separating wheat fro chaff in context (yes, it is possible to build this, I have. It works. This also opens up real streamed models. I have a true infinite context streamed model I toy with locally that I am getting to be audio/text in and audio/text out in real time.)

Put those together and you have O(1) token gen, infinite context and perfect recall. It is a whole new world of models. You can interact with a model until you have it at the state you want and then save its state and use that as if it were your system prompt. Batches pack perfectly so inference is massively more efficient. Training is massively more efficient. Transformer and unlimited attn models are a dead end. But how do you make money on this as an independent researcher? If I release the Two Weird Tricks this is all based on I get zip and the big players get even more tech for free. If I keep it all secret I get Zip and eventually the tricks will be figured out. (Yes a little frustration here) If anyone wants the model architecture of the future make me an offer :)

5 comments

regularfry 9 days ago

It's not quite true to say that if you release it you get nothing. If it's worthwhile and picked up by the open-weights labs, you get much bigger and better models implementing it than you would have had access to or been able to train otherwise, quicker than if they had to figure it out de novo.

link

jmward01 9 days ago

Yeah. I am about to the point of just releasing it all. I love the tech. It does amazing things. But I want to move to the next big things I can see doing with it and building the custom ops to get it to work efficiently is a pain. I am positive others would run with it and make it all way better which would free me up to do more.

link

EDM115 8 days ago

well if you ever release it, make sure to make a post so we can check it out !

link

in-silico 9 days ago

Neither of these strike me as particularly groundbreaking.

The first idea (as I understand it as retrieving token ids rather than hidden states) is going to really struggle to do useful compositional reasoning and contextual recall.

The second idea has been been done a million times, with Linear Attention being maybe the first modern example. Hyena, state-space models, DeltaNet, and LaCT also lie in different regions of the performance-parallelizability spectrum of fixed-size models.

link

bratao 9 days ago

I´m super curious about those "Two Weird Tricks". I would like that you would release more. It remember me the MiniMax Sparse Attention https://arxiv.org/html/2606.13392v1

link

jmward01 9 days ago

Yeah, looks like fun stuff. You still need to preserve the entire kv cache though right? So even if compute is drastically less, memory keeps growing. The system I described keeps memory constant (well, if you keep the entire token history you technically are gaining one long of data per token generated but I think we can agree that is negligible and could be capped at something high like 1B or so with no meaningful impact). I think I will probably release trick one and see if people then believe trick two even without seeing it.

link

jmward01 9 days ago

As a follow-up, I can see there is not a lot of belief which is why it is also hard to find a company to partner with on this. So, how -do- you make money on something like this as an independent researcher. Maybe I release trick one, show how guided window attn (and nn memory and probably a lot of robotics) can be trained? Thoughts? I can do that pretty quickly. By itself that is a pretty great tech (combined with fixed windows of full attn it is pretty amazing). The second trick, I think, is a bit more powerful although both are general purpose. If I do this, think people will believe trick two (and all the real time multi-modal streaming stuff)?

link

yorwba 9 days ago

Demonstrate results. If you can produce results that are somehow better than what already exists, it doesn't matter much what the actual trick is. If the way your results are better is difficult to explain without significant technical background knowledge, you might be limited to only a small pool of angel investors at first, but you only need to convince one to get funding for a better demo and intros to VCs with deeper pockets.

link

jmward01 8 days ago

Yeah. That is the plan I think I have settled on. I'll release something interesting here shortly but the full architecture, including all the multimodal input/output streaming is something I am considering my options on. I may even try to get to the 1-2b moderately well trained model stage and host it to show how transformative cached states are compared to cache tokens.

link

eikenberry 9 days ago

Isn't the classic way of making money off an invention is to patent it... so why not patent those "Two Weird Tricks"?

link

giancarlostoro 9 days ago

Expensive and if someone figures out a slight different way to do it you arent really “covered” its not a unique umbrella plus you would sort of give away the secrets.

link