| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ashirviskas 159 days ago

I wonder what if we just crammed more into the "tokens"? I am running an experiment of replacing discrete tokens with embeddings + small byte encoder/decoder. That way you can use embedding space much more efficiently and have it contain much more nuance.

Experiments I want to build on top of it:

1. Adding lsp context to the embeddings - that way the model could _see_ the syntax better, closer to how we use IDEs and would not need to read/grep 25k of lines just to find where something is used. 2. Experiments with different "compression" ratios. Each embedding could encode a different amount of bytes and we would not rely on a huge static token dictionary.

I'm aware that papers exist that explore these ideas, but so far no popular/good open source models employ this. Unless someone can prove me wrong.

4 comments

Yemoshino 159 days ago

I found a few papers in this direction with perplexity like this one https://ceur-ws.org/Vol-4005/paper1.pdf and it doesn't seem to be that relevant for now.

The progress of a handful models seem to be so much better (because limited compute, we have only a handful of big ones, i presume) that these finetunings are just not yet relevant.

I'm also curious if a english java + html + css + javascript only model would look like in size and speed for example.

Unfortunate whenever i ask myself the question of finetunging tokens (just a few days ago this question came up again), deep diving takes too much time.

Claude only got lsp support in november i think. And its not even clear to me to what extend. So despite the feeling we are moving fast, tons of basic ideas haven't even made it in yet

link

tuned 158 days ago

if you have a corpus of code snippets to train the manifold (Laplacian) on (and a good embedding model), it is definitely possible to try something like this.

link

stephantul 158 days ago

There’s many examples of noisily encoding a large embedding vocabulary. This sounds a bit like T-free or H-net? Or BLT?

One of the main issues with lines of work around this are that you end up trading embedding parameters for active parameters. This is rarely a good trade-off for the sake of compute.

link

nl 158 days ago

Isn't this just an awkward way of adding an extra layer to the NN, except without end-to-end training?

Models like Stable Diffusion sort of do a similar thing using Clip embeddings. It works, and it's an easy way to benefit from the pre-training Clip has. But for a language model it would seemingly make more sense to just add the extra layer.

link

ashirviskas 158 days ago

I mean this is exactly what it is. Just a wrapper to replace the tokenizer. That is exactly how LLMs can read images.

I'm just focusing on different parts

link

appplication 159 days ago

Not an expert in the space, but I’m not sure you need to modify tokens to get the model to see syntax, you basically get that exact association from attention.

link

ashirviskas 158 days ago

You get that association that is relevant to your project only if you can cram the whole codebase. Otherwise it is making rough estimates and some of the time that seems to be where the models fail.

It can only be fully resolved with either infinite context length, or doing it similar to how humans do it - add some LSP "color" to the code tokens.

You can get a feel of what LLMs deal with when you try opening 3000 lines of code in a simple text editor and try to do something. May work for simple fixes, but not whole codebase refactors. Only ultra skilled humans can be productive in it (using my subjective definition of "productive")

link