| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by x0x0 3966 days ago

That's bog standard -- every company uses hadoop. Then when you see the actual datasets, they're maybe a couple hundred gigs completely denormalized. Yet you still have to use hadoop/hive/spark to access them, with all the inefficiencies, complexity, and slowness those bring.

One of the things Trey skipped -- he got the first two only -- that is very annoying is the big breaks in the data science field are data scientist / analysis; data scientist / builder; and data engineer/etl. Data scientists' work sits on top a giant batch of data engineering, and often companies (imo intentionally) try to hire data scientists by dangling interesting analysis or implementation work, but when you dig deep enough or worse, accept the job offer, it's really 80%+ data engineering. (And they get pissy when you quit two months in after discovering this, both because that's not what I want to do and because relationships founded on lies tend not to work out well for employees.)

The other very difficult thing you get is project tests; it's hard to test something deeply in 5 hours. Even when companies claim to want to test statistics knowledge, the tests almost always turn out to be dominated by data ingestion/cleaning work. Or they're simply too much work. eg Stitchfix wanted me to spend 10+ hours implementing an analysis after just speaking to a recruiter, without even having spoken to one of their data scientists because they were "too busy". The recruiter was grumpy when I stopped responding to email.

1 comments

mistermann 3966 days ago

> Then when you see the actual datasets, they're maybe a couple hundred gigs completely denormalized. Yet you still have to use hadoop/hive/spark to access them, with all the inefficiencies, complexity, and slowness those bring.

I was always under the impression that one of the benefits of NoSQL was its speed, but then watching a webcast the other day querying a very small dataset, I was shocked at how slow it was, and this was in contrast to another demo where a different query was mind boggingly fast compared to comparable performance on a traditional SQL platform. (Yes, I know the particulars matter here and it's not that good of a question without that specificity, but any light you could shine on this would be appreciated.)

For data of "a couple hundred gigs", what platform would you say is more appropriate?

link

x0x0 3965 days ago

no, the benefit of nosql, at least for data science, is scalability. ie what do you do when you can't fit the data on a single machine. This works great at a former employer, who really did have pb scale datasets. The vast vast majority of companies do not have pb scale datasets. Most don't have tb datasets.

as for what do you do, postgres / mysql; pandas /R; or roll your own code depending on precisely what you need. But you can rack a pretty beefy box with 256g ram in it, 2 xeons, and a ton of ssd + spindle disk for $10k. Nothing that nosql or hadoop or spark do can't be done easier, written way faster, executed faster, and kept running more easily on a single box or even better in a single process.

For example: at my current gig, I work on 20-40g raw datasets. Ingest to pandas and externalize user agent strings drops it to 5g or so. That process takes 30 to 60 minutes, but I do it once, cache the results, and update incrementally.

link

bane 3966 days ago

Postgres, or depending on the particulars just start rolling your own.

link