|
I think putting Big Data & ML in one Bucket is a Big Mistake, pun intended.
From where I am (DS at a sv startup), I see a few discrete Big Buckets - 1. Offline Big Data - This is mostly the ETL crowd - Scalding, Cascading, Spark & associated novel startups, who provide technology to run Map Reduce jobs on TBs & PBs of data. This isn't going away anytime soon. Investment Banks & enterprise, financial institutions are the big customers with risk analysis( Var, CVar) & large scale monte-carlo scenarios on diverse financial instruments being commonplace. 2. Online Big Data - Storm, Summingbird & friends - continually ingesting high volume realtime data streams to provide realtime insights, which can be substantiated by #1 later, as and when those jobs run. For eg. say you ingest tweets realtime via a Storm pipeline & give me a running time series of how many tweets were from which city. Meanwhile, you squirrel away these tweets in hdfs so the offline MR job runs later & gives you exact counts. 3. Small-data ML - The result of #1 is typically a dataset of modest size ( few MB - few GB ) that can be ingested into your favorite ML solution ( too numerous to mention) for predictive analysis & BI purposes. 4. Soft "AI" - Using #2 + #3 in intelligent ad serving, traffic routing, realtime pricing to match inventory ( eg. there are several hotels in Las Vegas who reprice rooms based on number of passengers from commercial flights arriving into Vegas, local weather (sunny,rainy etc.), industry convention dates & such - all the ML + AI done out of a tiny office in SF), electricity regulation (https://news.ycombinator.com/item?id=8280315) etc. 5. AI without the quotes - tiny startups using rnn's to predict time series, using cnn's for image captioning & other really nifty AI applications not currently commercially exploitable at scale but definitely primed for acqui-hire. |
The two definitely complement each other, but they are not the same.