| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dacm 2763 days ago
	We've been packaging pandas in a lambda which is used to perform some calculations, but being a 50 MB zip file makes cold starts of about 6-8 secs. We're lucky that the service has little use, thus our way to workaround it is by having a lambda warmer which is run every 5 minutes and invokes N pandas lambdas. I'd be very interested in knowing if Layers has some feature to avoid this kind of issue.

4 comments

robbiemitchell 2763 days ago

We had the same cold start problem and couldn’t find a way to reliably keep things warm. For instance, Lambda would often spawn a new instance rather than re-use the warm one.

In the end, we came to the conclusion that Amazon is smart and won’t let you hack together the equivalent of a cheaper EC2.

link

jacques_chester 2761 days ago

I don't think it's deliberately so, just that developing a solution requires scheduling and routing to cooperate. Normally they're considered by separate systems. As your execution pool expands, this problem becomes worse, not better.

On the other hand, their incentive to solve the problem is relatively weak vs an on-premise alternative.

link

FraaJad 2763 days ago

If I were doing this today, I would prototype the problem in Python and after realising the startup penalty, would rewrite it in D's Mir [1] or Nim's ArrayMancer [2].

Life on a lambda is too short to pay 6-8 second startup penalty over and over millions of time.

[1]: https://github.com/libmir/mir-algorithm [2]: https://mratsim.github.io/Arraymancer/

link

dacm 2762 days ago

Our problem is that we have a team of data scientists who are familiar with Python, plus a decent set of custom tools written in it, so changing languages isn't an option

link

FraaJad 2762 days ago

that's often the current explanation for continued use of Pyhton and R.

Often it is a sign that the problem is not "big" enough (eg: not crunching truly large data sets) OR data science team gets disproportionate amount of goodwill (thus money) to spend on its foibles. :)

link

paddy_m 2763 days ago

How did you get the zip down to 50MB. I was under the impression that pandas+numpy was closer to 300MB and bumped up against AWS size limits. I was considering building some hacked together thing with S3

I came to this thread specifically to find out about numpy and pandas on lambda.

link

richstoner 2763 days ago

We've been running a stripped down version of numpy + scipy + matplotlib in lambda. We'd build the environment in a docker container with Amazon linux, manually remove unneeded shared objects and then rezip the local environment before uploading to s3.

A similar method is described here: https://serverlesscode.com/post/deploy-scikitlearn-on-lamba/

Layers should make this entire process easier.

link

dacm 2762 days ago

When I worked on this I used this article as a reference: https://serverless.com/blog/serverless-python-packaging/ and also ended up with a huge image. What that article didn't mention is that the virtual environment folder should be excluded in the serverless config file, as the runtime is provided by boto3. So adding:

package: exclude: - venv/

would reduce the size considerably (to 50 MB in my case)

link

ramraj07 2763 days ago

Why though? Is it cheaper than just running a bunch of servers?

link

dacm 2762 days ago

It is in our case. This is a service which is very seasonal, so it may be used during a couple of days each month only. Having a bunch of instances mostly idle would definitely be more expensive

link

yourapostasy 2762 days ago

How much delay from a cold start can your application tolerate? On the order of tenths of a second or up to one second?

link

dacm 2762 days ago

Being that the data is queried from a web app through HTTP, the shorter the better. Around 1 sec should be alright, but 6 - 8 definitely isn't

link