Hacker News new | ask | show | jobs
by Agebor 2745 days ago
I'm not sure if i understand their view on fundamental limitations - they don't seem fundamental to me:

1. It does not seem impossible to imagine a function that spawns code close to data, be it on a VM with a connected fast SSD drive already populated with data. Also, Lambda-at-edge and Cloudflare workers are already more like “shipping code to data.”, or " the customer" in this case.

2. Functions are load-balanced and potentially parallelisable to millions of invocations. The only missing piece is some kind of parallel-invoke call, to give each instance of a function a distinct piece of data to process, and an identifier to save the result under. The identifier could easily refer to a local disk location, in some future implementation.

Also, another point from the article: "FaaS discourages Open Source service innovation". seems wrong to me, as it may be said only about current super-early implementations, reason being, they are new.

In the long-run, I'd expect serverless to help open source, because of a simplicity of deployment. We will likely have projects working on an abstraction layer of compute and storage, hiding the underlying cloud or multi-cloud implementation. (Kubernetes is one candidate, it just needs ideas in the like of Virtual Kubelet to become more serverless).

7 comments

The paper didn’t mention any fundamental limitations. Its main point was that current serverless architectures suffer from two major performance deficiencies: inter[process/agent/what-have-you] communication is funneled through the bottleneck of slow storage (e.g., S3, DynamoDB); and various forms of optimizations based on caching are hamstrung by the fact that agents are short-lived and not directly addressable over the network.

The authors’ concern regarding the potential lack of open-source projects that integrate with the serverless platforms currently on offer is based on the severity—roughly between one and three orders of magnitude, overall—of the aforementioned performance deficiencies. It appears to be based on the assumption that open-source contributors won’t invest in what they (accurately) perceive to be a technically inferior platform. This part of the paper isn’t particularly clear, but I believe the authors are talking more about extensions to current serverless platforms than about application-level code that simply runs on top of said platforms.

Anyway, I’m no expert, but I’m fairly familiar with the subject matter and I read the paper in its entirety. I’m open to corrections from those who possess a deeper understanding of the issues involved.

One major caveat: I’m much more familiar with AWS than I am with its competitors, and I’m taking the authors’ word for it when they assert that their AWS-based examples are broadly representative.

> 1. It does not seem impossible to imagine a function that spawns code close to data, be it on a VM with a connected fast SSD drive already populated with data. Also, Lambda-at-edge and Cloudflare workers are already more like “shipping code to data.”, or " the customer" in this case.

This would work, of course. But doesn't it defeat at least some of the convenience of a "serverless" architecture if I still need to manage/configure servers with attached (and pre-populated) storage?

> 2. Functions are load-balanced and potentially parallelisable to millions of invocations...

Continuing from point (1), if the code needs to run proximate to data it may be difficult to achieve a huge number of parallel invocations. My parallel capacity is limited by the number of servers available for function execution, which is only those servers with direct/fast access to storage.

> This would work, of course. But doesn't it defeat at least some of the convenience of a "serverless" architecture if I still need to manage/configure servers with attached (and pre-populated) storage?

It might not be you who maintains the server. Internally, Amazon’s DynamoDB equivalent allows code owned by teams to run on data nodes triggered by events (writes, deletes, fetches). That code is run in a sandbox with certain constraints that ensure computation stays local. It’s serverless for the function owners.

In my experience that’s really only true at small scale. Once your dataset/traffic volume gets bigger you have to start getting much more hands on with sharding, keying/affinity, and availability.
When I left Amazon, this was a single data store with thousands of partitions, hundreds of billions of records, dozens of teams writing functions that ran on it, thousands of data sets, and hundreds of thousands of requests per second being made. Our team had several functions that handless thousands of requests per second. It was a critical piece of infrastructure, for among other things, Amazon retail, Prime, etc.

Sure, there was a team that owned the platform, but that wasn’t us. We were customers akin to AWS customers.

Joyent’s Manta system is closest to a ‘bring code to the data system’ as I’ve seen: https://www.joyent.com/blog/hello-manta-bringing-unix-to-big.... Though geared more to data processing than serving traffic.
Yes, I'd love to be able to use Manta every day. It's pretty crazy to write a simple shell pipeline and have it actually run not on my local machine but on all the data nodes.
> It does not seem impossible to imagine a function that spawns code close to data

The authors discuss this:

To achieve good performance, the infrastructure should be able and willing to physically colocate certain code and data. This is often best achieved by shipping code to data, rather than the current FaaS approach of pulling data to code. At the same time, elasticity requires that code and data be logically separated, to allow infrastructure to adapt placement: sometimes data needs to be replicated or repartitioned to match code needs. In essence, this is the traditional challenge of data independence, but at extreme and varying scale, with multi-tenanted usage and fine-grained adaptivity in time.

Not impossible to imagine, but the challenges are non-trivial.

This is very well put. In particular, there is a very common use case where Lambda-at-edge excels: thin query transformation layers, which both expand the number of use cases where back-and-forth can happen close to the user, and allow centralized services to focus on complex tasks worthy of the long haul.

Use lambda-at-edge to verify the format and size of an image. Ship it to your data center only when it's ready for permanent storage, object recognition AI, etc.

> "to imagine"

the paper critic is exactly that. the current state is missing too many easy to imagine points for othet paradigms

>Cloudflare workers are already more like “shipping code to data.”, or " the customer" in this case.

That's shipping code further from the data, unless you store all of your data on your customers' hardware.

The closest data would be in the datacenter closest to the customer, not on customer's hardware. But I mentioned it to illustrate that the code already is location-independent a little.
Is this your data or your customers' data? I'd much prefer the code that runs on my data to come to me, instead of me having to give my data away.