| Apologies for the long answer, but this touches on a lot of interesting points: 1. Transfer learning / data volume. If you have a small image dataset, embedding it using an embedding trained on a much larger image dataset is really really helpful. In our tutorial (https://www.basilica.ai/tutorials/how-to-train-an-image-mode...), we get good results with only 2-3k animal pictures, which is only possible because of the transfer learning aspect of embeddings. You could do transfer learning yourself, if you have the time and expertise. And for a domain like images, it's really easy to find big public datasets. But long-term we're hoping to have embeddings for a lot of areas where there aren't good public datasets, and pool data from all our customers to produce a better embedding than any of them could alone. 2. Ease of Use. You can take a Basilica image embedding, feed it into a linear regression running on your laptop CPU, and get really good results. To get equally good results on your own, you'd need to run tensorflow on GPUs. This is harder than it sounds for a lot of people. 3. Exploration. Because of the other two points, if you have a thought like "huh, I wonder if including these images would improve our pricing model", you can whip up some code and train a model in a few minutes to check. Maybe if it's a big model you go grab lunch while it trains. If you're doing everything from scratch in tensorflow, it can take days to try the same thing. This activation energy reduces the amount of experimentation people do. It's bad for the same reasons having a multi-day compile/test loop would be bad. |
I agree with what you're saying here. I just wonder how it would work in practice.
So imagine I have this monster text or image, and I want to know if it looks like another text or image.
I send each to Basilica, it gives me back two vectors and I compare the vectors.
I use the cosine of the vectors as a similarity score, and lets say it comes out to be 0.6.
However, I think this is too low, and I want to tweak my algorithm.
At this point, doesn't the question of how the vector was generated come to the front. Did you get rid of common words, how did you treat stems, and so on? Or did what biases did you introduce into training?
Furthermore, these questions come up right away, and they seem fundamental to whatever the main practice is.
In other words, can I even experiment or start without knowing how the word2vec works?