Hacker News new | ask | show | jobs
by dacm 2763 days ago
We've been packaging pandas in a lambda which is used to perform some calculations, but being a 50 MB zip file makes cold starts of about 6-8 secs. We're lucky that the service has little use, thus our way to workaround it is by having a lambda warmer which is run every 5 minutes and invokes N pandas lambdas. I'd be very interested in knowing if Layers has some feature to avoid this kind of issue.
4 comments

We had the same cold start problem and couldn’t find a way to reliably keep things warm. For instance, Lambda would often spawn a new instance rather than re-use the warm one.

In the end, we came to the conclusion that Amazon is smart and won’t let you hack together the equivalent of a cheaper EC2.

I don't think it's deliberately so, just that developing a solution requires scheduling and routing to cooperate. Normally they're considered by separate systems. As your execution pool expands, this problem becomes worse, not better.

On the other hand, their incentive to solve the problem is relatively weak vs an on-premise alternative.

If I were doing this today, I would prototype the problem in Python and after realising the startup penalty, would rewrite it in D's Mir [1] or Nim's ArrayMancer [2].

Life on a lambda is too short to pay 6-8 second startup penalty over and over millions of time.

[1]: https://github.com/libmir/mir-algorithm [2]: https://mratsim.github.io/Arraymancer/

Our problem is that we have a team of data scientists who are familiar with Python, plus a decent set of custom tools written in it, so changing languages isn't an option
that's often the current explanation for continued use of Pyhton and R.

Often it is a sign that the problem is not "big" enough (eg: not crunching truly large data sets) OR data science team gets disproportionate amount of goodwill (thus money) to spend on its foibles. :)

How did you get the zip down to 50MB. I was under the impression that pandas+numpy was closer to 300MB and bumped up against AWS size limits. I was considering building some hacked together thing with S3

I came to this thread specifically to find out about numpy and pandas on lambda.

We've been running a stripped down version of numpy + scipy + matplotlib in lambda. We'd build the environment in a docker container with Amazon linux, manually remove unneeded shared objects and then rezip the local environment before uploading to s3.

A similar method is described here: https://serverlesscode.com/post/deploy-scikitlearn-on-lamba/

Layers should make this entire process easier.

When I worked on this I used this article as a reference: https://serverless.com/blog/serverless-python-packaging/ and also ended up with a huge image. What that article didn't mention is that the virtual environment folder should be excluded in the serverless config file, as the runtime is provided by boto3. So adding:

package: exclude: - venv/

would reduce the size considerably (to 50 MB in my case)

Why though? Is it cheaper than just running a bunch of servers?
It is in our case. This is a service which is very seasonal, so it may be used during a couple of days each month only. Having a bunch of instances mostly idle would definitely be more expensive
How much delay from a cold start can your application tolerate? On the order of tenths of a second or up to one second?
Being that the data is queried from a web app through HTTP, the shorter the better. Around 1 sec should be alright, but 6 - 8 definitely isn't