| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by scottrogowski 1810 days ago

Hey, a couple of thoughts:

I think you identified the tradeoff correctly. What I'm building is designed for when compute needs outweigh the data transfer overhead. This works for some applications and not others. In particular, I think my approach will work especially well for ML model training and web scrapers.

That being said, the fewer bytes transferred the better. Apart from colocating servers, there's not much I can do for the data. No matter what, it has to go over an ethernet cable. It's possible to cache source code though and is something I'm working on doing over multiple layers.

In your example, for preprocessing pipelines, Fastmap probably wouldn't make sense - or at least that's my instinct. In my experience, it's rare to see pipeline steps where the compute significantly outweighs the data transfer. I'd be curious to hear your problem though. I might have a blind spot?

1 comments

icyfox 1809 days ago

Makes sense.

A lot of my work involves parsing crawled pages, ie. html -> structured dom tree -> some heuristic search or other algorithm to tag or extract elements. So in this case the compute per page is relatively high but I still think the overall latency of having to sync to a remote endpoint is too heavy.

One hybrid approach would be some hosted data product that is colocated with your compute cluster. On nonsensitive datasets I'd be happy to upload my initial data upfront if it means I can get linear scalability when I need to run compute.

link

scottrogowski 1809 days ago

Ah. I think I understand a little better. Knowing nothing else about what you're doing, the way I would structure that would be to run multiple fastmap tasks with crawler + processor code and then pull down just the processed data. So basically, have fastmap as the most upstream node.

Whether or not that's possible with the realities of your application is another question of course :).

If you're curious, I'd be happy to provide some compute time (say 100 vCPU-hours) in exchange for feedback. Email me if that's at all interesting.

link