| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mlucy 2711 days ago

Interesting. It took me a while to figure out what the main contribution here is, since doing dimensionality reduction on embeddings is fairly common. I think the main contribution is an empirical measure of A) how little is lost by reducing the dimensionality of the embeddings, and B) how much the dimensionality-reduced embeddings benefit from being fine-tuned on the final dataset.

It looks like fine-tuning the lower-dimensionality embeddings helps a lot (1% degradation vs. 2% for the offline dimensionality reduction according to the paper).

This makes me pretty happy, since I run a startup that trains and hosts embeddings (https://www.basilica.ai), and we made the decision to err on the side of providing larger embeddings (up to 4096 dimensions). We made this decision based on the efficacy of offline dimensionality reduction, so if fine-tuned dimensionality reduction works even better, that's great news.

1 comments

DoctorOetker 2710 days ago

my background is physics, but I have been reading and following AI papers for many years, without actually applying this knowledge.

I have an idea I wish to try out regarding word embeddings, and I am looking for either a minimalist or a very clearly documented library for word embeddings. With minimalist I do not refer to simplest in usage, but simplest in getting acquainted with the underlying codebase.

I have no interest in becoming a proficient user of a commercially relevant library, since machine learning is not how I earn money. But I can imagine one of the more theoretical courses on machine learning has a code base for students to tinker with. i.e. where the educational facet of how the word embedding algorithms work is somewhat prioritized over the commercial relevance and computational efficiency facets. Do you know of any such minimal implementations?

The idea concerns objective bias removal (potentially at the cost of loss of factual knowledge), so if it works it may be interesting outside of word embeddings as well...

EDIT: the only functional requirement I have of the codebase is that given a corpus, it can generate an embedding such that the classic "king" + "woman" - "man" = ~ "queen" is satisfied. the only computational requirement I have regarding computational efficiency is that it can generate this embedding on a single core laptop within a day for a roughly billion word corpus. any additional functionality is for my purpouses overcomplicating. So I am looking for the simplest (well-documented, preferably in paper form) codebase which achieves this.

link

speedplane 2710 days ago

>I have an idea I wish to try out regarding word embeddings, and I am looking for either a minimalist or a very clearly documented library for word embeddings.

Using AI on natural language is still in its infancy. You're not going to find a clean off-the-shelf solution if you're doing something unique.

Google Cloud has a natural language AI module that makes solving certain problems pretty easy (little to no coding). The next step is to look at packages like NLTK, which is more complex but can handle a larger set of natural language processing (you need to be comfortable with Python). If neither of the above tools does the job, you're going to need to dive much deeper, and learn about the fundamentals of neural nets and natural language processing and become familiar with tools like Tensorflow.

Natural language processing is a much more open field than image processing, despite a seemingly much lower bandwidth and data size. As far as data-size, all of Shakespeare's works can fit into a single iPhone image, but understanding what's going on in natural language is often a far more difficult task than what's going on in an image.

link

DoctorOetker 2710 days ago

I apologize for my late edit, and thank you for your response. I just need a minimal codebase that generates a word embedding that satisfies these quasi linear relationships.

link