|
|
|
|
|
by thisiszilff
789 days ago
|
|
Eh, I disagree. When I began working in ML everything was about word2vec and glove and the state of the art for embedding documents was adding together all the word embeddings and it made no sense to me but it worked. Learning about BoW and simple ways of convert text to fixed length vectors that can be used in ML algos clarified a whole for me, especially the fact that embeddings aren’t magic they are just a way to convert text to a fixed length vector. BoW and tf-idf vectors are still workhorses for routine text classification tasks despite their limitations, so they aren’t really a dead end. Similarity a lot of things that follow BoW make a whole lot more sense if you think of them as addressing limitations of BoW. |
|
The operation of adding BoW vectors together has nothing to do with the operation of adding together word embeddings. Well, aside from both nominally being addition.
It's like saying you understand what's happening because you can add velocity vectors and then you go on to add the binary vectors that represent two binary programs and expect the result to give you a program with the average behavior of both. Obviously that doesn't happen, you get a nonsense binary.
They may both be arrays of numbers but mathematically there's no relationship between the two. Thinking that there's a relationship between them leads to countless nonsense conclusions: the idea that you can keep adding word embeddings to create document embeddings like you keep adding BoWs, the notion that average BoWs mean the same thing as average word embeddings, the notion that normalizing BoWs is the same as normalizing word embeddings and will lead to the same kind of search results, etc. The errors you get with BoWs are totally different from the errors you get with word or sentence or document embeddings. And how you fix those errors is totally different.
No. Nothing at all makes sense about word embeddings from the point of BoW.
Also, yes BoW is a total dead end. They have been completely supplanted. There's never any case where someone should use them.