| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by threeseed 3983 days ago

I work exclusively in Fortune 500 size enterprises.

And the big EDW that you use to find powering everything has been broken over the years into unintegrated silos e.g. ERP, Web, Salesforce, Payroll etc. The big trend now is to reintegrate all this data and do analytics on it. To do this requires you to do (a) major ETL work between completely different schemas then (b) your data science/analytics work. In semi real time.

This article is referring to this type of workload since this is Spark's bread/butter. You land the data in HDFS, use Spark SQL to run ETL/Analytics jobs and then output the results in a single enterprise view for reporting, marketing etc. And yes this is identical to what Twitter's analytics team would be doing.

With cloud tools from Azure, IBM, Amazon this sort of analytics is going to be becoming much more common place. All using SQL the language but not SQL the database.

3 comments

angrybits 3983 days ago

The enterprise I work for won't touch cloud with a 10 foot pole, and I know this because we literally got told to quit asking about it. :)

So yes, even we are building out a pretty beefy internal Hadoop cluster, so I would never say that it will be all-relational-all-the-time. But my point was more that there will be copious amounts of SSAS cubes and Oracle warehouses for the foreseeable future. They work great for their use cases and they have well known problems with well known solutions. Doing what Twitter's team is doing when you aren't Twitter might not be the best idea for everyone, after all.

In our case, we use Teradata for our work and it's quite capable of handling very large workloads, and thus we currently have no plans to spin it down in favor of the new hotness. (Even though the new Hadoop cluster positively dwarfs our TD appliance.) I'd say we have a mixture of both on the horizon, if only because our DBAs are less than cooperative about Java UDFs, so Hadoop is the easiest way for us to do complex processing against our fairly large data set.

link

ak39 3983 days ago

For EDW, yes. You might see smaller federated data marts or even separately managed relational dbs all over the place. But for OLTP systems for the vast majority of enterprises out there, the vertical single instance big hunk database is still big dawg.

link

angrybits 3983 days ago

> separately managed relational dbs all over the place

On the BI side, this is overwhelmingly the outcome for large companies. The business units get silo'ed, they build their fiefdoms, a consolidation project gets kicked off and fails, rinse, repeat. Even if the consolidation succeeds, it takes extremely strong leadership to keep it from devolving right back to silos. The tech is not the cause of this problem, so I don't foresee it being the solution to it either.

link

vyrotek 3983 days ago

How real-time are the analytics with these implementations? When I think ETL I think daily chronjobs. Have there been advances in this space which would let me instantaneously see a lead created in Salesforce in these new reports?

link

angrybits 3983 days ago

We have a POC running where we stream web hit data onto HDFS in near real time (several seconds of latency perhaps). There's no reason to think you couldn't do it with other streams of data as well.

edit: Not sure about Salesforce specifically, sorry if this is too far off topic.

link

jaegerpicker 3981 days ago

The company I work for is building a near real time (web real time not real realtime) setup (a second or delayed) using AWS SQS and redshift with a custom message consumer. If you keep the message consumer as stateless as possible it's super scalable and reliable.

link