| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by otterk10 2543 days ago
	Scott here from ClearBrain - the ML engineer who built the underlying model behind our causal analytics platform. We’re really excited to release this feature after months of R&D. Many of our customers want to understand the causal impact of their products, but are unable to iterate quickly enough running A/B tests. Rather than taking the easy path and serving correlation based insights, we took the harder approach of automating causal inference through what's known as an observational study, which can simulate A/B experiments on historical data and eliminate spurious effects. This involved a mix of linear regression, PCA, and large-scale custom Spark infra. Happy to share more about what we did behind the scenes!

3 comments

6gvONxR4sf7o 2542 days ago

>observational study, which can simulate A/B experiments

This is 100% overselling. Observational studies can be suggestive, but cannot replace experiments. Unobserved variables cannot be accounted for.

link

otterk10 2542 days ago

Thanks for the feedback! I totally agree about observational studies being suggestive but not replacing A/B tests - that’s why the main use case I listed in the blog (and how current customers have used the product so far) is “prioritization of a/b tests”, not replacing a/b tests themselves. The language around “simulating a/b tests” is just a way to try to concisely explain to someone at a high level who may not be very technical or have much experience with causal inference. Happy for suggestions on how to better convey this without over-selling!

link

bertil 2542 days ago

I’ve noticed two questions on twitter:

- Do you use a causal graph? Would it make sense?

- Spark seems over-kill for what you yourself describe as regression: is there something more intensive here that we could be missing?

link

otterk10 2542 days ago

Our analysis runs over our user’s customer data (usually collected through either a tag manager or a CDP such as Segment), which is a few petabytes of data for some of our larger customers. The reason for using Spark is to quickly transform this massive amount of raw data into a ML-ready format. You’re correct that the regression itself does not need to be done inside of Spark.

link

otterk10 2542 days ago

We didn’t explore causal graphs because doing so would require manually creating a causal graph for each relationship that you wish to explore. Our goal was to create an automated approach that could provide an estimate of the treatment effect for any page/event within your app.

link

lootsauce 2543 days ago

Would love to hear more about the architecture and ml behind your approach. We have been doing more ml in BigQuery and it has been a great fit for us.

link

otterk10 2543 days ago

Good to hear! In my experience, BigQuery ML (and other cloud ml products) is great for creating basic models out of the box, but don't provide a ton of flexibility for non-standard ML use-cases. For example, our approach to causal analytics requires doing things such as dimensionality reduction and computing a covariance matrix that are not available through BigQuery ML.

So what we've done instead is create a SparkML task that can read in a feature matrix and trains and scores the causal analytics model. The causal lift estimates for each user are then written out to BigQuery so that in our frontend a customer can filter for, say, users between the ages of 18-35, and then within seconds we'll return them the causal lift of viewing page X for this segment.

link