| HN Mirror

They are incomparable. Token embeddings generated with something like word2vec worked well because the networks are shallow and therefore the learned semantic data can be contained solely and independently within the embeddings themselves. Token embeddings as a part of an LLM (e.g. gpt-oss-20b) are conditioned on said LLM and do not have fully independent learned data, although as shown here there still can be some relationships preserved.

Embeddings derived from autoregressive language models apply full attention mechanisms to get something different entirely.