|
|
|
|
|
by threeseed
3983 days ago
|
|
I work exclusively in Fortune 500 size enterprises. And the big EDW that you use to find powering everything has been broken over the years into unintegrated silos e.g. ERP, Web, Salesforce, Payroll etc. The big trend now is to reintegrate all this data and do analytics on it. To do this requires you to do (a) major ETL work between completely different schemas then (b) your data science/analytics work. In semi real time. This article is referring to this type of workload since this is Spark's bread/butter. You land the data in HDFS, use Spark SQL to run ETL/Analytics jobs and then output the results in a single enterprise view for reporting, marketing etc. And yes this is identical to what Twitter's analytics team would be doing. With cloud tools from Azure, IBM, Amazon this sort of analytics is going to be becoming much more common place. All using SQL the language but not SQL the database. |
|
So yes, even we are building out a pretty beefy internal Hadoop cluster, so I would never say that it will be all-relational-all-the-time. But my point was more that there will be copious amounts of SSAS cubes and Oracle warehouses for the foreseeable future. They work great for their use cases and they have well known problems with well known solutions. Doing what Twitter's team is doing when you aren't Twitter might not be the best idea for everyone, after all.
In our case, we use Teradata for our work and it's quite capable of handling very large workloads, and thus we currently have no plans to spin it down in favor of the new hotness. (Even though the new Hadoop cluster positively dwarfs our TD appliance.) I'd say we have a mixture of both on the horizon, if only because our DBAs are less than cooperative about Java UDFs, so Hadoop is the easiest way for us to do complex processing against our fairly large data set.