Hacker News new | ask | show | jobs
by minimaxir 897 days ago
Some notes on how embeddings/DistilBERT embeddings work since the other comments are confused:

1) There are two primary ways to have models generate embeddings: implicitly from an LLM by mean-pooling its last hidden state since it has to learn how to map text in a distinct latent space anyways to work correctly (i.e. DistilBERT), or you can use a model which can generate embeddings directly which are trained using something like triplet loss to explicitly incentivise learning similarity/dissimilarity. Popular text-embedding models like BAAI/bge-large-en-v1.5 tend to use the latter approach.

2) The famous word2Vec examples of e.g. woman + king = queen only work because word2vec is a shallow network and the model learns the word embeddings directly, instead of it being emergent. The latent space still maps them closely as shown with this demo, but there isn't any algebraic intuition. You can get close with algebra but no cigar.

3) DistilBERT is pretty old (2019) and based on a 2018 model trained on Wikipedia and books, so there will be significant text drift in addition to being less robust with newer modeling techniques and a more robust dataset. I do not recommend using it for production applications nowadays.

4) There is an under-discussed opportunity for dimensionality reduction techniques like PCA (which this demo uses to get the data into 3D) to both improve signal-to-noise and improve distinctiveness. I am working on a blog post of a new technique to handle dimensionality reduction for text embeddings better which may have interesting and profound usability implications.

1 comments

I’ve been ruminating on the postulation of a universal signature for every entity across sensory complexes (per sense organ reality, vision, touch, mind) which translates to the problem of entities represented in binary needing to be related across modalities as in “butterfly” vs a picture of a butterfly vs the audio of butterfly vs the thought pointing to one of those or other.

I was wondering if there was a universal signal that can be used as the identity and then based on that signal one could measure the distance to any other signal based on the principle relation of not(other). That is to say the identity would be precisely not all else for any X. Said another way, every thing is because it is exactly not everything else.

So thinking as first principles as possible I wondered if it were possible to represent everything as some frequency? A Fourier transform analog for every “time slice” of a thing? This is where it gets slightly slippery.

So the idea was trying to build relationship and identity and labeling from a simple rule set of things arising out of relation of not being other things.

In my mind I saw nodes on a graph forming in higher dimensions as half way points for any comparison. Comparisons create new nodes and implicitly have a distance metric to all other things. It made sense in my mind that there was an algorithmic annealing to new nodes in a “low density higher energetic state” allowing them to move faster in this universal emergent ontology/spatial space; eventually getting more dense and slower as it gets cold.

So the system implicitly also has a snapshot of events or interactions based on that where every comparison has a “tick” that encodes a particular density relation for some set of nodes it’s in association with.

The idea that cemented it all together was to treat each node like an address:chord. Similar to chording keys like a-b-c in some ux programs, but also exactly like chords in music too.

The idea being that when multiple “things” are dialed in at same time it becomes its own emergent label by proximity and association of those things being triggered to new information coming in classified as a distance to not(signal).

I didn’t really realize how close this idea was to what encoders/decoders seem to be doing although I do know I’m trying to think myself towards a universal solution that doesn’t require special encoders for every media type. Hence the Fourier transform path.

Know anything like this or am I spitting idiocy?

So the alphabet a to z… on their own the symbols mean nothing but when compared to every other letter meaning arises. Then iterate / recursively out for every growth in structure and letter to letter, words to words, paragraphs to paragraphs. Each one has a “dependent arising” of meaning based precisely on the relation to other.

Which is more or less word2vec as far as I understand but then trying to extrapolate that as a universal principle to all things that can be represented by using a “common signature : hash based off a signal like a complex waveform” and then doing a difference on signal composition and its shape/bandwidth to compare its properties to other things and when they reference similar objects even in different modalities they’d be associated by being triggered together.

So “dog” vs image of dog would both translate to a primordial signal : identity representation and in the domain of frequency do the comparison and project a coordinate in the spatial sense and eventually those two nodes would more likely be triggered at the same time due to the likelihood of “dog” being next to image of dog when parsing information across future events.

Whew. Maybe I’m just talking to myself. At least it’s out there if it makes sense to anyone else.

> So “dog” vs image of dog would both translate to a primordial signal : identity representation and in the domain of frequency do the comparison and project a coordinate in the spatial sense and eventually those two nodes would more likely be triggered at the same time due to the likelihood of “dog” being next to image of dog when parsing information across future events.

That is how CLIP embeddings work and were trained to work.

Hugging Face transformers now has a get_image_features() and get_text_features() function for CLIP models to make getting the embeddings for different modalities easy: https://huggingface.co/docs/transformers/model_doc/clip#tran...

Yeah but it doesn’t use a universal method does it? And it requires labeling.

The method I’m describing requires no labeling. Labeling would be a local only translation (alias). Labels emerge based on meaning. But the labels are more of an interface - not the actual nodes themselves which arise off the not identity principle * event proximity * comparisons.

Labeling (which is typically manual and thus not scalable) is a proxy for comparisons. Two things are the same if they have the same label. The question is how else to encode the comparison information.
Right one is manual the other is automatic and my hypothesis is you can have automatic universal labeling the way I describe
The key requirement in my mind here is that the “universal identifier” is a form of attempting something like a deterministic signature for all things. The hunch is based on the hypothesis that the primordial representation of any and all things is frequency.

But of course each “ontological capable system” would still need to process the identity function to start making sense of things based on signals being unlike other signals, so deterministic is shallow but concrete.