Hacker News new | ask | show | jobs
by blorgle 3429 days ago
> I don't think, or get the impression, that this would be used for full-system virtualization. It seems to be more targeted toward an AWS Lambda sort of usage pattern; a micro-VM that spins up to serve one API request, or to act as a long-running tiny daemon to do some housekeeping task. It looks like a fancy fork(), to me, rather than a competitor to Docker or LXC (and especially not KVM or Xen, which can run a whole Linux kernel in the VM).

Almost. With ZeroCloud (OpenStack Swift + ZeroVM + appropriate middleware), you should imagine Lambda+S3 in the same service! Your "function" executes much, much closer to the location of the data, it doesn't require "shipping" from the storage service to the compute service and back again. You can take in an object, perform a transform (e.g. text search, encryption, transcode, etc) and store the result as a new object.

If you think about it, it's kind of the future of large scale computing, immutable dataset + immutable compute, that can horizontally scale to huge numbers of nodes.

1 comments

> With ZeroCloud (OpenStack Swift + ZeroVM + appropriate middleware), you should imagine Lambda+S3 in the same service!

Is that something that exists in a production form today? What's an example of the use case? I'm having trouble visualizing this "Your "function" executes much, much closer to the location of the data, it doesn't require "shipping" from the storage service to the compute service and back again." That sounds like going back to a monolithic model where data and functions are tightly coupled, but I assume I'm visualizing it wrong, since that would be moving backward.

Suppose you have a dataset big enough that it needs to be spread across a storage cluster. Now you'd like to run some kind of operation that's either completely parallel, or fits the map/reduce model. Maybe you have a petabyte worth of video files and you want to generate thumbnails, or extract metadata, or find all frames of video with text in them.

If compute is separated from storage, then all of that video data has to be streamed over the network from a storage node to a compute node before computation can even begin; the data is "shipped" to compute.

Presumably the function you want to execute is vastly smaller than the data. It would require much less time and bandwidth to run the function on the same node as the data it's accessing; no network overhead. Assuming you have an adequate balance between compute and storage, you get much lower latency access to the data.

Some downsides include - running arbitrary code on your storage node means trusting your users or having very good sandboxing - you now have to balance compute and storage on any given node

Joyent Manta is basically this. (https://www.joyent.com/manta) Bryan Cantrill has a good talk on why it's a useful thing. (I'd link it but I'm on mobile)