|
|
|
|
|
by mlucy
2711 days ago
|
|
Interesting. It took me a while to figure out what the main contribution here is, since doing dimensionality reduction on embeddings is fairly common. I think the main contribution is an empirical measure of A) how little is lost by reducing the dimensionality of the embeddings, and B) how much the dimensionality-reduced embeddings benefit from being fine-tuned on the final dataset. It looks like fine-tuning the lower-dimensionality embeddings helps a lot (1% degradation vs. 2% for the offline dimensionality reduction according to the paper). This makes me pretty happy, since I run a startup that trains and hosts embeddings (https://www.basilica.ai), and we made the decision to err on the side of providing larger embeddings (up to 4096 dimensions). We made this decision based on the efficacy of offline dimensionality reduction, so if fine-tuned dimensionality reduction works even better, that's great news. |
|
I have an idea I wish to try out regarding word embeddings, and I am looking for either a minimalist or a very clearly documented library for word embeddings. With minimalist I do not refer to simplest in usage, but simplest in getting acquainted with the underlying codebase.
I have no interest in becoming a proficient user of a commercially relevant library, since machine learning is not how I earn money. But I can imagine one of the more theoretical courses on machine learning has a code base for students to tinker with. i.e. where the educational facet of how the word embedding algorithms work is somewhat prioritized over the commercial relevance and computational efficiency facets. Do you know of any such minimal implementations?
The idea concerns objective bias removal (potentially at the cost of loss of factual knowledge), so if it works it may be interesting outside of word embeddings as well...
EDIT: the only functional requirement I have of the codebase is that given a corpus, it can generate an embedding such that the classic "king" + "woman" - "man" = ~ "queen" is satisfied. the only computational requirement I have regarding computational efficiency is that it can generate this embedding on a single core laptop within a day for a roughly billion word corpus. any additional functionality is for my purpouses overcomplicating. So I am looking for the simplest (well-documented, preferably in paper form) codebase which achieves this.