| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by boxcarr 2338 days ago

I'm the person who did the original prototype in Python/Pandas. The purpose of the prototype was to prove that with the right data presentation, you could process more data and cut processing and storage by over two orders of magnitude. I picked search rankings given the size of the data and the struggle the legacy system had in terms of showing more than a limited amount of data.

Previously, the MySQL solution had many rows for each ranking spanning several tables with all kinds of other data not related to showing the aggregate statistics that were relevant to Moz's users.

The solution was processing the raw data in batch and applying categorization to have an integer representation for all values. These changes led to a very compact representation that could quickly be loaded into memory and then filtered/aggregated.

Storing the results as a CSV wasn't important. It just turned out that having the static data allowed me to effortlessly scale-out serving since the data was only updated once a day or once a week, and it was append-only.

How big of a difference did it make? All of the CSVs individually compressed was less than 20GB. The production system served all user data off of a ~60 node MySQL cluster, with rankings being the most costly in terms of processing and 2nd in terms of disk usage (from what I remember).

Also, Pandas was blazing fast at loading several megabytes of CSV data (< 80ms @ the time). If I had to do it again today, I'd probably use Apache Parquet instead.

The most important insight that carried through the various solutions was to pre-process the data so that it was easily consumable for the task at hand. The languages (Python/Elixir) didn't make a difference, in my opinion. That said Pandas is fantastic, it made working with that data in-memory very easy, at least for this prototype.