Hacker News new | ask | show | jobs
by ag8 191 days ago
This is a cool setup, but naively it feels like it would require hundreds of thousands of hours of data to train a decent generalizable model that would be useful for consumers. Are there plans to scale this up, or is there reason to believe that tens of thousands of hours are enough?
1 comments

Yeah I think the way we trained the embedding model focused a lot on how to make it as efficient as possible, since it is such a data-limited regime. So I think based on (early) scaling results, it'll be closer to 50-70k hours, which we should be able to get in the next months now we've already scaled up a lot.

That said, the way to 10-20x data collection would be to open a couple other data collection centers outside SF, in high-population cities. Right now, there's a big advantage in just having the data collection totally in-house, because it's so much easier to debug/improve it because we're so small. But now we've mostly worked out the process, it should also be very straightforward for us to just replicate the entire ops/data pipeline in 3-4 parallel data collection centers.

I live nearish to Seattle, in Tacoma. I would be willing to setup a center.