| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by semi-extrinsic 3565 days ago

Well, this is actually covered in the accompanying blogpost (link in comments below), and he makes a salient point:

"At the same time, it is worth understanding which of these features are boons, and which are the tail wagging the dog. We go to EC2 because it is too expensive to meet the hardware requirements of these systems locally, and fault-tolerance is only important because we have involved so many machines."

Implicitly: the features you mention are only fixes introduced to solve problems that were caused by the chosen approach in the first place.

1 comments

aub3bhat 3565 days ago

"The features you mention are only fixes introduced to solve problems that were caused by the chosen approach in the first place."

The chosen approach is the only choice! There is a reason why smart people at thousands of companies use Hadoop. Fault-tolerance and Multi-user support are not mere externalities of the chosen approach but fundamental to performing data science in any organization.

Before you further comment, I highly highly encourage you to get a "Real world" experience in data science by working at a large or even medium sized company. You will realize that outside of trading engines, "faster" is typically the third or fourth most important concern. For data and computed results to be used across organization, they need to stored centrally, similarly hadoop allows you to centralize not only data but also computations. When you take this into account, it does not matter how "Fast" command line tools are on your own laptop. Since now your speed, is determined by the slowest link, which is data transfer over the network.

link

semi-extrinsic 3565 days ago

"Gartner, Inc.'s 2015 Hadoop Adoption Study has found that investment remains tentative in the face of sizable challenges around business value and skills.

Despite considerable hype and reported successes for early adopters, 54 percent of survey respondents report no plans to invest at this time, while only 18 percent have plans to invest in Hadoop over the next two years. Furthermore, the early adopters don't appear to be championing for substantial Hadoop adoption over the next 24 months; in fact, there are fewer who plan to begin in the next two years than already have."

So lots of big businesses are doing just fine without Hadoop and have no plans for beginning to use it. This seems very much at odds with your statement that "The chosen approach is the only choice!"

In fact I would hazard a guess that for businesses that aren't primarily driven by internet pages, big data is generally not a good value proposition, simply because their "big data sets" are very diverse, specialised and mainly used by certain non-overlapping subgroups of the company. Take a car manufacturer, for instance. They will have really big data sets coming out of CFD and FEA analysis by the engineers. Then they will have a lot of complex data for assembly line documentation. Other data sets from quality assurance and testing. Then they will have data sets created by the sales people, other data sets created by accountants, etc. In all of these cases they will have bespoke data management and analysis tools, and the engineers won't want to look at the raw data from the sales team, etc.

link

sgt101 3565 days ago

My experience echos the OP of this thread; having data in one place backed by a compute engine that can be scaled is a huge boon. Enterprise structure, challenges and opportunities change really fast now, we have mergers new businesses, new products and the requirements to create evidence to support the business conversations that these generate is intense. A single data infrastructure cuts the time required to do this kind of thing from weeks to hours - I've had several engagements where the hadoop team has produced answers in an afternoon that were then later "confirmed" from the proprietary datawarehouses days or weeks later after query testing and firewall hell.

For us Hadoop "done right" is the only game in town for this usecase, because it's dirt cheap per TB and has a mass of tooling. It's true that we've underinvested, but mostly because we've been able to get away with it, but we are running 1000's of enterprise jobs a day through it and without it we would sink like a stone.

Or spend £50m.

link

semi-extrinsic 3565 days ago

Is there anything in your Hadoop that's not "business evidence", financials, user acquisition etc?

My point is that there are many many business decisions driven by analysing non-financial big data sets that physically cannot be done with data crunched out in five hours. These may even require physical testing or new data collection to validate your data analysis.

Like I mentioned, anyone doing proper Engineering (as in, professional liability) will have the same level of confidence in a number coming out of your Hadoop system as they would in a number their colleague Charlie calculated on a napkin at the bar after two beers. Same goes for people in the pharma/biomolecular/chemical industries, oil and gas, mining etc etc.

link

nl 3564 days ago

What are you talking about?

I personally know people working in mining, oil/gas as well as automotive engineering (which you mentioned previously). All rely on Hadoop. I'm sure I could find you some in the other fields too.

Are you seriously thinking Hadoop isn't used outside web companies or something?

Terradata sells Hadoop now, because people are migrating datawarehouses off their older systems. This isn't web stuff, it is everything the business owns.

link

sgt101 3565 days ago

One of the developments that we're after is radical improvements in data quality and standards of belief (provenance, verification, completeness).

A huge malady that has sometimes effected business is decisions made on the basis of spreadsheets of data that are from unknown sources, contradicted left, right and centre and full of holes.

A single infrastructure helps us do this because we can establish KPI's on the data and control those (as it's coming to the centre rather than a unit providing summaries or updates with delays) we know when data has gone missing and have often been able to do something about it. In the past it was gone, and by the time that was known there was no chance of recovery.

Additionally we are able to cross reference data sources and do our own sanity checks. We have found several huge issues by doing this, systems reporting garbage, systems introducing systematic errors.

I totally agree, if you need to take new readings then you have to wait for the readings to come in before making a decision. This is the same no matter what data infrastructure you are using.

On the other hand there is no reason to view data coming out of Hadoop as any less good than data coming from any other system, apart from the assertion that Hadoop system X is not being well run, which is more of a diagnosis of something that needs fixing than anything else I think.

There are several reasons (outlined above) to believe that a well run data lake can produce high quality data. If an Engineer ignored (for example) a calculation that showed that a bridge was going to fail because the data that hydrated it came out of my system and instead waited for a couple of days for data to arrive from the stress analysis group, metallurgy group and traffic analysis group would they be acting professionally?

Having said all that I do believe that there are issues with running Hadoop data lakes that are not well addressed and stand in the way of delivering value in many domains. Data audit, the ethical challenges of recombination and inference and security challenges generated by super empowered analysts all need to be sorted. Additionally we are only scratching the surface of processes and approaches to managing data quality and detecting data issues.

link

luckydata 3565 days ago

Yeah, that sounds fun, dozens of undocumented data silos without supervision that some poor bastard will have to troubleshoot as soon as the inevitable showstopper bug crops up.

link

manigandham 3565 days ago

Most medium and big enterprises have a working set of data around 1-2TB. Enough to fit in memory on a single machine these days.

link