Hacker News new | ask | show | jobs
by jen20 3821 days ago
Has the author (if they are reading here) considered using Joyent's Manta to take the processing to the data instead?
4 comments

There are plenty of architectures that do exactly this. EMR-on-S3, Google Dataproc on GCS, Snowflake-on-S3, BigQuery-on-GCS, etc etc.

The bigger point in the article is that these exact "take processing to the data" architectures operate exceedingly well on S3, GCS, Azure.

And, as a biased observer, these architectures operate on GCS the best due to great performance measured in the article, quick VM standup times, low VM prices, and per-minute billing.

I'm still trying to parse the docs and Manta source code to see what it actually does, but it seems unique if the data storage nodes are also the data processing nodes and no data transfer happens from some storage service before the job begins. The other key factor is having neither startup time nor the cost of a perpetually running cluster. Per my comment below [1], we have used Lambda with S3 to get something like this, as well as our own architecture built on plain EC2/GCE nodes.

[1] https://news.ycombinator.com/item?id=10846514

Not only that but the thing is built by guys who really know what they are doing like Bryan Cantrill and other former SUN top people.
got it. thanks!
As you sure you understand what "take the processing to the data" means?

EMR-on-S3 is the "copy the data to the processing nodes" variety.

I think Manta is better if the result set is smaller than input set. So network performance won't matter that much. And also a per second pricing is better since the author need the result in 10 seconds.

Spinning up a cluster of VMs and use 10 seconds and they charge you min. 1 hour seems expensive to me.

I don't know about Manta, but this is the entire point of HDFS. It easier to move code than data.
Indeed, but they're having such fun. Let's leave them be.
Hadn't heard of it, looks cool. Thanks for the tip :)