Hacker News new | ask | show | jobs
by jeremystan 3779 days ago
Behind the scenes, Instacart is a revolutionary new e-commerce marketplace, an incredibly sophisticated last-mile logistics engine, and a dynamic source of work for thousands of personal shoppers. Each of these aspects of Instacart could be a whole company elsewhere, and data science plays a key role in our success in each endeavor.

In this article, I (our VP Data Science) highlight some challenges the data science team is tackling at Instacart ranging from logistics to personalization. I also go into detail on how we have organized data science to have maximal impact and what we look for when recruiting data scientists.

2 comments

fyi I interviewed for this team and 2 notes:

1 - I've been doing ad optimization / user classification / propensity scoring / product recommendations, warned them that I took a couple stochastic processes classes but haven't used them for a decade, and that I was entirely unsuited for OR type problems. They said that was ok and they where hiring for things I was suited for. Great. My in-person interview was primarily an OR problem best solved with stochastic processes.

2 - they were very responsive at first but after the interview, went radio silent for a week. After promising a response in a day. This was particularly annoying since I told them I had a written offer that I was pushing off for them. My guess is they were waiting to see if another candidate would accept. Which is fine, but the recruiter should have been honest with me. They ignored me for 5 business days after the interview -- 4 after their promised response -- before finally telling me no thanks. I'm not grumpy about being told no -- that's definitely happened before -- but their crappy behavior. Fortunately I'd already accepted the other offer after reading between the lines, but still, the experience left me grumpy.

I debated posting this for a while, but bluntly, I kind of felt like they wasted my time and was really not happy their internal recruiter blew me off after repeated promises otherwise. I'm sure they'll be along to say your experience will be different (and it may well be!) but here's a data point for your consideration. I'm just sharing my experience.

> they were very responsive at first but after the interview, went radio silent for a week. After promising a response in a day. This was particularly annoying since I told them I had a written offer that I was pushing off for them.

ugh, I hate that. I was jerked around by an a-list firm like that this fall and it was totally frustrating, especially because I turned down another offer while I was waiting to hear back. Really screwed me over and left me very bitter/annoyed. There's no respect and everyone's looking out for #1.

We work hard to get back to candidates quickly - in some cases the same day as they interview, but at least the day after if not. We know the market is very competitive and want candidates to make the best decisions they can - so it's in our best interest to act quickly! We are always working to improve how we screen, interview and respond to candidates in hiring, so will take this feedback to heart. Thank you for providing it.

Regarding the focus on OR, that was definitely the case for our first few years, and while it's still important, we have definitely expanded our focus beyond it.

Not sure if you realize but all your responses here seem terribly canned. That's probably also why your first comment in this thread is being downvoted.
re-reading what I wrote I can see how it comes off as canned - thanks for the feedback
>“We will take full ownership of our projects. We take pride in our work and relentlessly execute to get things completely finished.”

Does this mean that as an employee, or ex-employee, I can take my owned projects with me and use them for my own purposes?

We've worked hard to open source projects whenever we think they'll be useful broadly: https://www.instacart.com/opensource. There is definitely more of this we can (and I hope will) do in the future.
Why don't you pay your workers?

https://news.ycombinator.com/item?id=11121092

How do you guys run R in production? Just getting started with R based datascience and it has been a struggle to figure out how to build a production data science stack.

Do you snapshot the computed models as RData and stream them to s3, etc

We use R in production in two ways:

1. For batch processes that run daily, hourly or minutely, where the models are rebuilt on every run, and outputs (often predictions) are written to a database 2. For computation of coefficients in large sparse regularized models, where the coefficients are written to a database and scoring is done in another language in real-time

For situations where we want real-time predictions, recommendations or optimizations, we tend to setup Python services instead. For batch processes, you can definitely store models in S3 to re-use them, and I've done that at other companies. But in general I've found it better to rebuild models frequently and cache them for short periods of time only if they are cost-prohibitive to rebuild.

@jeremy - that's helpful! Could you hint at the way you persist sparse models to the DB in R. Especially if you are changing your variables pretty frequently. do you use something like Postgres JSONB (which is funky in R).

Also about scoring in another language - is this really worthwhile for you ? I have often debated just throwing 128GB of RAM on an R machine and calling it a day. As I figure, your "real time" requirements are probably seconds or even minutes (similar to mine).

To persist sparse models to the DB, especially if you use L1 regularization (like Lasso) then many coefficients will be 0, and don't need to be stored or processed. Insofar as you store coefficients and features in a "tall" format (e.g., user, feature_key, feature_value), then space is conserved. Scoring can be done in DB with joins and group by, or in another language with similar operations.

Changing variables frequently can be versioned in the feature and model coefficients tables, but takes care.

I haven't used Postgres JSONB, but if you have problems with JSON in R check out the tidyjson package (I wrote when dealing with Mongo data previously).

Scoring in another language is best avoided if you can. But supporting R "real time" services will also come with many complications. Hence, we use Python when we really want that.

SparkR was completely unreliable when I first tested it over a year ago, but may have improved. Though the Spark Python API has some limitations compared to Scala, so I would guess the latest SparkR is even further behind, but we haven't tested it. Long term I'd love for that to be the answer to these questions.

If I were running R in production, then I'd probably fit models on some kind of batch process and then serve up the predictions/output from a DB or something.

In general, R is not well-suited for DB-backed websites in real-time, but you can certainly use the outputs in production.

You can do it, but I'm not sure it's worth the effort. You could probably provide a predict() interface in real-time if it was reasonably quick.

So I have seen a couple of large data science driven startups (like consumer finance) to throw R on 128gb machines and call it a day. That's reasonably going to be my plan except that I can't make it work very well.

I really wish pandas had a "save workspace" feature - R does that very well. No point in saving to dB if you're going to need the data set in memory anyway.... Or use Hadoop.

We run udfs in Hive to invoke R models, which is fine for compiling dashboards and reports but I wouldn't run it for something that needed instant responses.
That's what my company does
could you talk a bit more about your production setup ? Any multi-threading problems ?