| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by icyfox 1814 days ago

Makes sense.

A lot of my work involves parsing crawled pages, ie. html -> structured dom tree -> some heuristic search or other algorithm to tag or extract elements. So in this case the compute per page is relatively high but I still think the overall latency of having to sync to a remote endpoint is too heavy.

One hybrid approach would be some hosted data product that is colocated with your compute cluster. On nonsensitive datasets I'd be happy to upload my initial data upfront if it means I can get linear scalability when I need to run compute.

1 comments

scottrogowski 1814 days ago

Ah. I think I understand a little better. Knowing nothing else about what you're doing, the way I would structure that would be to run multiple fastmap tasks with crawler + processor code and then pull down just the processed data. So basically, have fastmap as the most upstream node.

Whether or not that's possible with the realities of your application is another question of course :).

If you're curious, I'd be happy to provide some compute time (say 100 vCPU-hours) in exchange for feedback. Email me if that's at all interesting.