Hacker News new | ask | show | jobs
by pgao 2166 days ago
I absolutely, 120% agree on the importance of adding the right data. Aquarium helps you with: "what data should I be collecting to improve my model" and "where do I find that data?"

For the latter, Aquarium treats the problem of smart data sampling as a search and retrieval problem. You want to find more examples of a "target" from a large stream of unlabeled data. Aquarium does this by comparing embeddings of the unlabeled data to your "target set" and then sending examples to labeling if they're within a defined distance threshold in embedding space. We don't actually do the labeling, but we wrap around common labeling providers and can integrate into in-house flows with our API.

1 comments

Other founder here! For a high level overview of this framing of the problem, I recommend reading this Waymo blog post [1].

One nice feature is that by using embeddings produced by a user's model, which has been trained in the context of their domain, we can do this sort of smart sampling in domains we've never seen before. Embeddings are also naturally anonymized, so we can do this without access to a user's potentially private raw data streams.

[1] https://blog.waymo.com/2020/02/content-search.html