| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by x0x0 3966 days ago

no, the benefit of nosql, at least for data science, is scalability. ie what do you do when you can't fit the data on a single machine. This works great at a former employer, who really did have pb scale datasets. The vast vast majority of companies do not have pb scale datasets. Most don't have tb datasets.

as for what do you do, postgres / mysql; pandas /R; or roll your own code depending on precisely what you need. But you can rack a pretty beefy box with 256g ram in it, 2 xeons, and a ton of ssd + spindle disk for $10k. Nothing that nosql or hadoop or spark do can't be done easier, written way faster, executed faster, and kept running more easily on a single box or even better in a single process.

For example: at my current gig, I work on 20-40g raw datasets. Ingest to pandas and externalize user agent strings drops it to 5g or so. That process takes 30 to 60 minutes, but I do it once, cache the results, and update incrementally.