Hacker News new | ask | show | jobs
by aaronjg 5225 days ago
We often struggle writing for both audiences, and your feedback is well taken.

Here's a brief rundown of the math, more details can be found in the papers linked below [1,2].

We assume a latent attrition model, that is customers purchase with exponentially distributed interpurchase times, and have a constant probability of dying. We then assume that the rate parameters of these two distributions are gamma distributed.

The gamma distribution is the first choice of distribution because it is the conjugate prior for the exponential distribution. For the Pareto/NBD it means that we can write the likelihood function without having to use quadrature to solve the integral. It is possible than another distribution would work even better, though it would likely be more computationally intensive.

Another nice characteristic over, say, the log-normal is that when the shape parameter is less than 1, lim_{x -> 0} = \Infty. This is a nice feature for many customer bases who have many infrequent customers, or many one-time customers.

For the percent error numbers, we picked a representative sample of our clients who had over two years of data, and ran the three models with a holdout set of the most recent year. We then compared the performance of the Pareto/NBD compared to ARPU and compared with picking the year old cohort. I uploaded a boxplot of the data, which you might find more informative [3].

Happy to chat more about the math here or by email (aaron@custora.com). Also would love to hear more about your retail startup and your CLV issues around that.

[1] http://www.jstor.org/pss/2631608

[2] http://marketing.wharton.upenn.edu/documents/research/Fader_...

[3] http://blog.custora.com/custora-content/uploads/2012/02/esti... (Note, the boxplot was generated a few months ago from different data, and we've updated the numbers for the blog post)

1 comments

These details are great! But also too technical for me to follow immediately since I'm not a statistician. It's enough information to point me in the right direction, though.

It's hard for me to express my writerly intuition, here. The key is to (1) define your target audience as concretely as possible and (2) understand what shared vocabulary you have at your disposal.

Saying "Bayesean" or "gamma distribution" is going to put you out of reach of anyone non-technical. Saying "conjugate prior" or "shape parameter" is going to put you out of reach of anyone who isn't a practiced statistician.

I'd aim somewhere in the middle. Technical people who aren't afraid of following a well-outlined, mathematical description of a problem they encounter regularly. "Why a gamma distribution?" would be a good footnote, for example, linking to a paper or another blog post of yours that explains it in more detail.

I find the articles that do the best are ones which take a somewhat-complicated topic and explain it, step-by-step, to an intelligent, technical audience. Pretend you were giving a lecture to a room full of HN members -- all technical and versed in basic mathematics, but not practiced statisticians. Write something that is not just a one-off, but could serve as reference material months and years from now.

I'm probably at the upper-end of this target audience in terms of mathematical maturity, in the sense that when you say "conjugate prior" I know what you mean but can't remember the definition off the top of my head. However, I could look it up and understand it instantly. I'd have a harder time understanding why being the conjugate prior of the exponential distribution implies that the gamma distribution is well-suited for modeling customer behavior on an e-commerce site.

(I'd really like to know, though.)

To this day I have people link to articles I wrote 3-4 years ago when talking about certain topics (A/B testing, viral marketing, etc.).

Example: http://blog.socialcam.com/mobile-ab-testing-made-easy

This isn't to say the article SocialCam linked to is something fantastic. The content is really basic stuff anyone who has taken one or two statistics classes knows. It's success is more in how it is written and explained than the content. In other words: digestibility and clarity are features.

Anyhow, I'm done yammering. Nice article! I've printed out a handful of academic papers related to the topic.

Cheers!