Hacker News new | ask | show | jobs
by otterk10 2499 days ago
Our analysis runs over our user’s customer data (usually collected through either a tag manager or a CDP such as Segment), which is a few petabytes of data for some of our larger customers. The reason for using Spark is to quickly transform this massive amount of raw data into a ML-ready format. You’re correct that the regression itself does not need to be done inside of Spark.