Hacker News new | ask | show | jobs
by _akhe 795 days ago
> We now have a method of embedding a variable length piece of text into a fixed size vector

Question: Is it a rule that the embedding vector must be higher dimensional than the source text? Ideally 1 token -> a 1000+ length vector? The reason I ask is because it seems like it would lose value as a mechanism if I sent in a 1000 character long string and only got say a 4-length vector embedding for it. Because only 4 metrics/features can't possibly describe such a complex statement, I thought it was necessary that the dimensionality of the embedding be higher than the source?

1 comments

No. Number of characters in a word has nothing to do with dimensionality of that word’s embedding.

GPT4 should be able to explain why.