Hacker News new | ask | show | jobs
by dxbydt 4170 days ago
I think putting Big Data & ML in one Bucket is a Big Mistake, pun intended. From where I am (DS at a sv startup), I see a few discrete Big Buckets -

1. Offline Big Data - This is mostly the ETL crowd - Scalding, Cascading, Spark & associated novel startups, who provide technology to run Map Reduce jobs on TBs & PBs of data. This isn't going away anytime soon. Investment Banks & enterprise, financial institutions are the big customers with risk analysis( Var, CVar) & large scale monte-carlo scenarios on diverse financial instruments being commonplace.

2. Online Big Data - Storm, Summingbird & friends - continually ingesting high volume realtime data streams to provide realtime insights, which can be substantiated by #1 later, as and when those jobs run. For eg. say you ingest tweets realtime via a Storm pipeline & give me a running time series of how many tweets were from which city. Meanwhile, you squirrel away these tweets in hdfs so the offline MR job runs later & gives you exact counts.

3. Small-data ML - The result of #1 is typically a dataset of modest size ( few MB - few GB ) that can be ingested into your favorite ML solution ( too numerous to mention) for predictive analysis & BI purposes.

4. Soft "AI" - Using #2 + #3 in intelligent ad serving, traffic routing, realtime pricing to match inventory ( eg. there are several hotels in Las Vegas who reprice rooms based on number of passengers from commercial flights arriving into Vegas, local weather (sunny,rainy etc.), industry convention dates & such - all the ML + AI done out of a tiny office in SF), electricity regulation (https://news.ycombinator.com/item?id=8280315) etc.

5. AI without the quotes - tiny startups using rnn's to predict time series, using cnn's for image captioning & other really nifty AI applications not currently commercially exploitable at scale but definitely primed for acqui-hire.

1 comments

I agree. I don't care so much about grouping different disciplines of big data together, so much as putting it into the same category as machine learning.

The two definitely complement each other, but they are not the same.