|
|
|
|
|
by icyfox
1814 days ago
|
|
Interesting product - have felt an acute need for something similar while writing machine learning preprocessing pipelines without having to spin up a dask or pyspark cluster. How are you dealing with data streaming latency here? For most of the things I've worked on the compute needs grow O(n) or O(n^2) with the dataset size. Farming out the compute to a remote server might solve the CPU bottlenecks but at the expense of having to pay the network transfer overhead. For the speed of most pipelines I'm not sure that's a viable tradeoff. |
|
I think you identified the tradeoff correctly. What I'm building is designed for when compute needs outweigh the data transfer overhead. This works for some applications and not others. In particular, I think my approach will work especially well for ML model training and web scrapers.
That being said, the fewer bytes transferred the better. Apart from colocating servers, there's not much I can do for the data. No matter what, it has to go over an ethernet cable. It's possible to cache source code though and is something I'm working on doing over multiple layers.
In your example, for preprocessing pipelines, Fastmap probably wouldn't make sense - or at least that's my instinct. In my experience, it's rare to see pipeline steps where the compute significantly outweighs the data transfer. I'd be curious to hear your problem though. I might have a blind spot?