Hacker News new | ask | show | jobs
by optimalsolver 971 days ago
How do you choose the size of the embedding vector?

Can this be used to compress non-text sequences such as byte strings?

2 comments

1. Usually it's a multi-way tradeoff between how much data you want to use, how much compute you want to spend, how much time you have available, how much training data you have available and how accurate you want the embeddings to be.

2. Yes, but lossily. Some types of byte strings are such that it doesn't matter if you accidentally change a couple of bits, some types of byte strings cannot tolerate that at all without being hopelessly corrupted. This technique is not a magic card to surpass the limits imposed by information theory, it's "just" a more sophisticated dictionary for your compression algorithm.

Regarding the second question, yes, as long as you can train a machine learning model to learn the semantics. The keyword if you want to look into this is Autoencoders.