| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sgt101 4198 days ago

>Secondly, the author seems to have conflated two different parts of the data science picture. Yes great analysts who do amazing work is important. But it relies on (a) having data available and (b) in the right format. For those of us doing significant volume ingestions it is not trivial to do this. Hadoop is painfully slow and overall data science end to end tooling is slow, fragmented and incomplete. Some of us do need vendors to be bold and coming up with new technologies/approaches.

I think you are doing Hadoop wrong, or confusing current technical reality with "Hadoop". Hadoop is very cheap, and it allows all the datas to be in one place This is huge for large scale data science, because in the past we had to pull data across networks fiddle, sample and chuck. The business case for single enterprise datawarehouses was difficult to make (because of the cost) and maintaining them when a CIO with vision did make the case was impossible because it took about 10 minutes for some genius to start running a tactical operational system on it, which was followed (in about 10 minutes more) by a howling call of rage from an MD about why his operational system was locked up due to someone doing stupid queries, which was followed by a lock down on queries in the warehouse.

If your hadoop cluster is slow then 1) move to CHD5 and use spark, use Impala, upgrade to 40Gbe throughout and make sure that you have balance in your architecture, for god's sake do not be telling people Hadoop is slow if you are using AWS. 2) brew your own cluster with GPU's and the various crazy infrastructures supporting said architecture (good luck) 3) go talk to an FPGA vendor or a super computer vendor and upgun (but you must be rich) Exalitics or Yark might work for you.

>And the point about IBM is just stupid. Did you ever think that maybe Watson DID help them slow their sales losses ? Weird that a data scientist would make predictions based on inadequate data.

Every IBM rep I have met for the last 3 years has told me that Watson will deal with churn and provide better offer management. I have repeatedly tried to get POC's and always always failed. Then we saw Watson tools on Bluecloud and all our suspicions of what Watson is and was are confirmed. Cudos to the Watson team, they spotted that Jeopardy questions can be rewritten as search queries, and spotted that search responses can be rewritten as jeopardy answers.

BTW. did anyone get far with Deepdive?

1 comments

pwang 4198 days ago

> If your hadoop cluster is slow...

You're right, there is a lot of misinformation and hope about Hadoop out there, and I think there is a lot of value in Hadoop as a cheap data integration archive. But I think the parent poster's point still stands. A Hadoop-based infrastructure currently has a lot of impedance mismatch for full end-to-end advanced analytics with a bunch of stats, linear algebra, or graph stuff from native code which are not Java-based.

I would love to see a TCO analysis on Hadoop+analytics versus buying a more traditional "supercomputer" stack with infiniband or one of the nifty Cray/SGI NUMA systems. Current data warehouse and BI folks are fixated on cost per PB of storage, and Hadoop is very cheap based on that single metric. I suspect that if enough human factors and accuracy/agility of modeling results are considered, the latter may be quite cost effective. It's just that the "big iron" vendors are still in the middle of retooling their marketing for the BI/DW/ETL crowd. When they finally figure it out, it's going to be a bloodbath.

For instance, SGI UVs can give me 24TB-64TB of RAM in a single "system". I still have to make sure I do multithreading/multiprocessing well, but the interconnects are lower latency than 40GBe. https://www.sgi.com/products/servers/uv/

HP ProLiants now can fit 48-60 cores and 6TB in a single 4U system: http://www8.hp.com/us/en/products/servers/proliant-servers.h...

Buying a few of these scale-up systems is a LOT cheaper than hundreds of nodes of Hadoop sitting around maxing out I/O while their expensive Xeons have 10% CPU load. Especially given than you can hire anyone out of science/engineering grad school and they can program these scale-up systems, whereas writing a bunch of Java MR jobs for Hadoop is quite foreign to them.

sgt101 4198 days ago

I think that the disruptions are : - Twill with everything (inc unikernels) under Yarn (or Mesos) - The Machine (if it's real) - Datacentre scale integration (so things like 500 different processors in each u which are powered up by the fabric manager to efficiently meet the workload at hand)

I think any vendor who wants to compete with the Open-Source/commodity world will need to do as well as / better than the above to get anywhere!

Programming MR is all done - I wrote MR in Java in 2008->12; never will I again as it's rdd's, transformations and actions now, and it's dead easy (MR is too but the API wasn't)!