| HN Mirror

Yeah I think the way we trained the embedding model focused a lot on how to make it as efficient as possible, since it is such a data-limited regime. So I think based on (early) scaling results, it'll be closer to 50-70k hours, which we should be able to get in the next months now we've already scaled up a lot.

That said, the way to 10-20x data collection would be to open a couple other data collection centers outside SF, in high-population cities. Right now, there's a big advantage in just having the data collection totally in-house, because it's so much easier to debug/improve it because we're so small. But now we've mostly worked out the process, it should also be very straightforward for us to just replicate the entire ops/data pipeline in 3-4 parallel data collection centers.