| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by otterk10 2499 days ago
	Our analysis runs over our user’s customer data (usually collected through either a tag manager or a CDP such as Segment), which is a few petabytes of data for some of our larger customers. The reason for using Spark is to quickly transform this massive amount of raw data into a ML-ready format. You’re correct that the regression itself does not need to be done inside of Spark.