|
|
|
|
|
by scottrogowski
1810 days ago
|
|
Hey, a couple of thoughts: I think you identified the tradeoff correctly. What I'm building is designed for when compute needs outweigh the data transfer overhead. This works for some applications and not others. In particular, I think my approach will work especially well for ML model training and web scrapers. That being said, the fewer bytes transferred the better. Apart from colocating servers, there's not much I can do for the data. No matter what, it has to go over an ethernet cable. It's possible to cache source code though and is something I'm working on doing over multiple layers. In your example, for preprocessing pipelines, Fastmap probably wouldn't make sense - or at least that's my instinct. In my experience, it's rare to see pipeline steps where the compute significantly outweighs the data transfer. I'd be curious to hear your problem though. I might have a blind spot? |
|
A lot of my work involves parsing crawled pages, ie. html -> structured dom tree -> some heuristic search or other algorithm to tag or extract elements. So in this case the compute per page is relatively high but I still think the overall latency of having to sync to a remote endpoint is too heavy.
One hybrid approach would be some hosted data product that is colocated with your compute cluster. On nonsensitive datasets I'd be happy to upload my initial data upfront if it means I can get linear scalability when I need to run compute.