Hacker News new | ask | show | jobs
by zeroxfe 2293 days ago
> The reverse is true, embeddings are both the performance and memory-footprint bottleneck of modern NN models.

They may be a bottleneck, but the alternative is worse -- you can't fit complex models with large vocabularies into GPU memory using sparse one-hot encodings.

2 comments

Surely you mean dense one-hot?

Technically, the sparse one-hot encoding is the most efficient in terms of memory footprint. You simply store the non-zero coordinates.

The problem in practice for GPUs is that sparse vector/matrix operations are too inefficient.

The whole point of something like this paper is to skip the entire 'densification' step and to directly deal with the sparse matrix input as a sparse matrix. The LSH is used in this paper improves on directly using SpMSpV, as that is also inefficient on CPUs, although to a lesser extent than GPUs.

No, you can successfully fit complex models if you use byte-pair or similar encodings (morphessor [1] comes to mind).

[1] https://morfessor.readthedocs.io/en/latest/

You also will get much more meaningful embeddings from summing embeddings of part of the word.