Hacker News new | ask | show | jobs
by posix_compliant 2908 days ago
Good post, but I couldn't disagree more. Regardless of your business size, it will always be valuable to know information such as:

* How does every additional coupon-dollar affect the total amount a customer buys?

* What is the relationship between customer age and retention for my store?

* Does giving a customer more purchase options help or hurt their chances of making a purchase?

My experience is that each of these questions can be solved, in part, using 3 lines of Python code:

    from sklearn.linear_model import LinearRegression
    lr = LinearRegression()
    lr.fit(X,y)
Then look at the beta coefficients of the model, and you have a rough idea of how different features are correlated. Doing something like this in SQL sounds difficult. If you have data to interpret, it makes sense to use similar methods. I can't think of an example where you have data but refuse to look at it until your company is "bigger".
2 comments

I overall agree that ML is needed over 'just SQL' in a lot of cases (though SQL + good visualizations / exploratory analysis can answer a lot of those questions qualitatively). I would also be careful with the linear model approach. Multicollinearity can hide how important a feature is (or reverse sign of a feature) when trying to use coefficients to interpret importance, so using a linear model like that isn't as straightforward as it seems.

As a workaround, you could look for high VIF to detection multicollinearity, use some sort of stepwise selection / penalized regression, or use something like relaimpo (https://cran.r-project.org/web/packages/relaimpo/index.html) - not sure of a Python equivalent - to judge overall feature importance in the model.

Ha - maybe a SQL layer is lurking behind the scenes, crafting the input variables that make your little python script so powerful