Hacker News new | ask | show | jobs
by icyfox 1814 days ago
Interesting product - have felt an acute need for something similar while writing machine learning preprocessing pipelines without having to spin up a dask or pyspark cluster.

How are you dealing with data streaming latency here? For most of the things I've worked on the compute needs grow O(n) or O(n^2) with the dataset size. Farming out the compute to a remote server might solve the CPU bottlenecks but at the expense of having to pay the network transfer overhead. For the speed of most pipelines I'm not sure that's a viable tradeoff.

1 comments

Hey, a couple of thoughts:

I think you identified the tradeoff correctly. What I'm building is designed for when compute needs outweigh the data transfer overhead. This works for some applications and not others. In particular, I think my approach will work especially well for ML model training and web scrapers.

That being said, the fewer bytes transferred the better. Apart from colocating servers, there's not much I can do for the data. No matter what, it has to go over an ethernet cable. It's possible to cache source code though and is something I'm working on doing over multiple layers.

In your example, for preprocessing pipelines, Fastmap probably wouldn't make sense - or at least that's my instinct. In my experience, it's rare to see pipeline steps where the compute significantly outweighs the data transfer. I'd be curious to hear your problem though. I might have a blind spot?

Makes sense.

A lot of my work involves parsing crawled pages, ie. html -> structured dom tree -> some heuristic search or other algorithm to tag or extract elements. So in this case the compute per page is relatively high but I still think the overall latency of having to sync to a remote endpoint is too heavy.

One hybrid approach would be some hosted data product that is colocated with your compute cluster. On nonsensitive datasets I'd be happy to upload my initial data upfront if it means I can get linear scalability when I need to run compute.

Ah. I think I understand a little better. Knowing nothing else about what you're doing, the way I would structure that would be to run multiple fastmap tasks with crawler + processor code and then pull down just the processed data. So basically, have fastmap as the most upstream node.

Whether or not that's possible with the realities of your application is another question of course :).

If you're curious, I'd be happy to provide some compute time (say 100 vCPU-hours) in exchange for feedback. Email me if that's at all interesting.