Our analysis runs over our user’s customer data (usually collected through either a tag manager or a CDP such as Segment), which is a few petabytes of data for some of our larger customers. The reason for using Spark is to quickly transform this massive amount of raw data into a ML-ready format. You’re correct that the regression itself does not need to be done inside of Spark.
We didn’t explore causal graphs because doing so would require manually creating a causal graph for each relationship that you wish to explore. Our goal was to create an automated approach that could provide an estimate of the treatment effect for any page/event within your app.