Hacker News new | ask | show | jobs
by otterk10 2496 days ago
Scott here from ClearBrain - the ML engineer who built the underlying model behind our causal analytics platform.

We’re really excited to release this feature after months of R&D. Many of our customers want to understand the causal impact of their products, but are unable to iterate quickly enough running A/B tests. Rather than taking the easy path and serving correlation based insights, we took the harder approach of automating causal inference through what's known as an observational study, which can simulate A/B experiments on historical data and eliminate spurious effects. This involved a mix of linear regression, PCA, and large-scale custom Spark infra. Happy to share more about what we did behind the scenes!

3 comments

>observational study, which can simulate A/B experiments

This is 100% overselling. Observational studies can be suggestive, but cannot replace experiments. Unobserved variables cannot be accounted for.

Thanks for the feedback! I totally agree about observational studies being suggestive but not replacing A/B tests - that’s why the main use case I listed in the blog (and how current customers have used the product so far) is “prioritization of a/b tests”, not replacing a/b tests themselves. The language around “simulating a/b tests” is just a way to try to concisely explain to someone at a high level who may not be very technical or have much experience with causal inference. Happy for suggestions on how to better convey this without over-selling!
I’ve noticed two questions on twitter:

- Do you use a causal graph? Would it make sense?

- Spark seems over-kill for what you yourself describe as regression: is there something more intensive here that we could be missing?

Our analysis runs over our user’s customer data (usually collected through either a tag manager or a CDP such as Segment), which is a few petabytes of data for some of our larger customers. The reason for using Spark is to quickly transform this massive amount of raw data into a ML-ready format. You’re correct that the regression itself does not need to be done inside of Spark.
We didn’t explore causal graphs because doing so would require manually creating a causal graph for each relationship that you wish to explore. Our goal was to create an automated approach that could provide an estimate of the treatment effect for any page/event within your app.
Would love to hear more about the architecture and ml behind your approach. We have been doing more ml in BigQuery and it has been a great fit for us.
Good to hear! In my experience, BigQuery ML (and other cloud ml products) is great for creating basic models out of the box, but don't provide a ton of flexibility for non-standard ML use-cases. For example, our approach to causal analytics requires doing things such as dimensionality reduction and computing a covariance matrix that are not available through BigQuery ML.

So what we've done instead is create a SparkML task that can read in a feature matrix and trains and scores the causal analytics model. The causal lift estimates for each user are then written out to BigQuery so that in our frontend a customer can filter for, say, users between the ages of 18-35, and then within seconds we'll return them the causal lift of viewing page X for this segment.