Hacker News new | ask | show | jobs
by bertil 2495 days ago
I’ve noticed two questions on twitter:

- Do you use a causal graph? Would it make sense?

- Spark seems over-kill for what you yourself describe as regression: is there something more intensive here that we could be missing?

2 comments

Our analysis runs over our user’s customer data (usually collected through either a tag manager or a CDP such as Segment), which is a few petabytes of data for some of our larger customers. The reason for using Spark is to quickly transform this massive amount of raw data into a ML-ready format. You’re correct that the regression itself does not need to be done inside of Spark.
We didn’t explore causal graphs because doing so would require manually creating a causal graph for each relationship that you wish to explore. Our goal was to create an automated approach that could provide an estimate of the treatment effect for any page/event within your app.