|
|
|
|
|
by x0x0
3919 days ago
|
|
That's bog standard -- every company uses hadoop. Then when you see the actual datasets, they're maybe a couple hundred gigs completely denormalized. Yet you still have to use hadoop/hive/spark to access them, with all the inefficiencies, complexity, and slowness those bring. One of the things Trey skipped -- he got the first two only -- that is very annoying is the big breaks in the data science field are data scientist / analysis; data scientist / builder; and data engineer/etl. Data scientists' work sits on top a giant batch of data engineering, and often companies (imo intentionally) try to hire data scientists by dangling interesting analysis or implementation work, but when you dig deep enough or worse, accept the job offer, it's really 80%+ data engineering. (And they get pissy when you quit two months in after discovering this, both because that's not what I want to do and because relationships founded on lies tend not to work out well for employees.) The other very difficult thing you get is project tests; it's hard to test something deeply in 5 hours. Even when companies claim to want to test statistics knowledge, the tests almost always turn out to be dominated by data ingestion/cleaning work. Or they're simply too much work. eg Stitchfix wanted me to spend 10+ hours implementing an analysis after just speaking to a recruiter, without even having spoken to one of their data scientists because they were "too busy". The recruiter was grumpy when I stopped responding to email. |
|
I was always under the impression that one of the benefits of NoSQL was its speed, but then watching a webcast the other day querying a very small dataset, I was shocked at how slow it was, and this was in contrast to another demo where a different query was mind boggingly fast compared to comparable performance on a traditional SQL platform. (Yes, I know the particulars matter here and it's not that good of a question without that specificity, but any light you could shine on this would be appreciated.)
For data of "a couple hundred gigs", what platform would you say is more appropriate?