Hacker News new | ask | show | jobs
by mlucy 2789 days ago
That's a really interesting idea.

I can't really think of a barrier to this. Detecting the file format is straightforward, and generic image/text/etc. embeddings work surprisingly well. (In fact, you can actually get some generalization gains by training subword text embeddings on corpora in multiple languages.)

If we wanted to able to use specific embeddings (e.g. photos vs. line art, English vs. German), we could probably do it by running the data through a generic embedding, and then seeing which cluster of training data it's closest to and running it through that specific embedding.

It would be really important in this case to make sure that all the specific embeddings are embedding into the same space, in case people have a mixed dataset, but that's very doable.