|
|
|
|
|
by mistermann
3918 days ago
|
|
> Then when you see the actual datasets, they're maybe a couple hundred gigs completely denormalized. Yet you still have to use hadoop/hive/spark to access them, with all the inefficiencies, complexity, and slowness those bring. I was always under the impression that one of the benefits of NoSQL was its speed, but then watching a webcast the other day querying a very small dataset, I was shocked at how slow it was, and this was in contrast to another demo where a different query was mind boggingly fast compared to comparable performance on a traditional SQL platform. (Yes, I know the particulars matter here and it's not that good of a question without that specificity, but any light you could shine on this would be appreciated.) For data of "a couple hundred gigs", what platform would you say is more appropriate? |
|
as for what do you do, postgres / mysql; pandas /R; or roll your own code depending on precisely what you need. But you can rack a pretty beefy box with 256g ram in it, 2 xeons, and a ton of ssd + spindle disk for $10k. Nothing that nosql or hadoop or spark do can't be done easier, written way faster, executed faster, and kept running more easily on a single box or even better in a single process.
For example: at my current gig, I work on 20-40g raw datasets. Ingest to pandas and externalize user agent strings drops it to 5g or so. That process takes 30 to 60 minutes, but I do it once, cache the results, and update incrementally.