| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vlahmot 3565 days ago

I feel like every time something like this comes up people completely skip over the benefit of having as much of your data processing jobs in one ecosystem as possible.

Many of our jobs operate on low TBs and growing but even if the data for a job is super small I'll write it in Hadoop (Spark these days) so that the build, deployment, and schedluing of the job is handled for free by our curent system.

Sure spending more time listing files on S3 at startup than running the job is a waste but far less than the man hours to build and maintain a custom data transformation.

The main benefit of these tools is not the scale or processing speed though. The main benefits are the fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations provided by the community.

4 comments

aub3bhat 3565 days ago

Having worked at a large company and extensively used their Hadoop Cluster, I could not agree more with you.

The author of the blogpost/article, completely misses the point. The goal with Hadoop is not minimizing the lower bound on time taken to finish the job but rather maximizing disk read throughput while supporting fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations as you noted. Hadoop has enabled Hive, Presto and Spark.

The author completely forgets that the data needs to transferred in from some network storage and the results need to be written back! For any non-trivial organization ( > 5 users), you cannot expect all of them to SSH into a single machine. It would be an instant nightmare. This article is essentially saying "I can directly write to a file in a local file system faster than to a database cluster", hence the entire DB ecosystem is hyped!

Finally Hadoop is not a monolithic piece of software but an ecosystem of tools and storage engine. E.g. consider Presto, software developers at Facebook realized the exact problem outlined in the blogpost but instead of hacking bash scripts and command line tools, they built Presto. Which essentially performs similar functions on top of HDFS. Because of the way it works Presto is actually faster than "command line" tools suggested in this post.

https://prestodb.io/

geocar 3565 days ago

> you cannot expect all of them to SSH into a single machine

Why not?

I can do exactly this (KDB warning). Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.

> The author of the blogpost/article, completely … The author completely … [Here's my opinions about what Hadoop really is]

This is a very real data-volume with two realistic solutions, and thinking that every problem is a nail because you're so invested in your hammer is one of the things that causes people to wait for 26 minutes instead of expecting the problem to take 12 seconds.

And it gets worse: Terrabyte-ram machines are accessible to the kinds of companies that have petabytes of data and the business case for it, and managing the infrastructure and network bandwidth is bigger time sink than you think (or are willing to admit).

If I see value in Hadoop, you might think I'm splitting hairs, so let me be clear: I think Hadoop is a cancerous tumour that has led many smart people to do very very stupid things. It's slow, it's expensive, it's difficult to use, and investing in tooling is just throwing good money after bad.

arjie 3565 days ago

> Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.

You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. You now need a bunch of experts because any machine failure is a critical data loss situation, and a throughput cliff situation. Took a fairly low-stakes situation (disk failure, let's say) and transformed it into the Mother of All Failures.

And then you've decided that next year your working set won't be over the max for the particular machine type you've decided on. What will you even do then? Sell this and get the new bigger machine? Hook them both up and run things in a distributed fashion?

And then there's the logistics of it all. You're going to use tooling to submit jobs on this machine, and there's got to be configurable process limits, job queues, easy scheduling and rescheduling, etc. etc.

I mean, I'm firmly in the camp that lots of problems are better solved on the giant 64 TB, 256 core machines, but you're selling an idea that has a lot of drawbacks.

qwertyuiop924 3565 days ago

And people with 64TB, 256 core machines don't have RAID arrays attached to their machine for this exact reason?

If it's "machines" plural, than you can do replication between the two. There's your fallover in case of complete failure.

yid 3563 days ago

> If it's "machines" plural, than you can do replication between the two.

This is the start of a scaling path that winds down Distributed Systems Avenue, and eventually leads to a place called Hadoop.

(Replication and consensus are remarkably difficult problems that Hadoop solves).

qwertyuiop924 3563 days ago

Fair 'nuff, but if you don't distribute compute, and you store the dataset locally on all the systems (not necessarily the results of mutations, just the datasets and the work that's being done), you'll still possibly reap massive perf gains over Hadoop in certain contexts.

faragon 3565 days ago

Both disk and CPU failures are recoverable on expensive hardware.

nickpsecurity 3565 days ago

"You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. "

There's usually a combination of apps that work within the memory of the systems plus huge amount of external storage with a clustered filesystem, RAID, etc. Example supercomputer from SGI below since you brought them up that illustrates how they separate compute, storage, management and so on. Management software is available for most clusters to automate or make easy a lot of what you described in later paragraph. They use one. It was mostly a solved problem over a decade ago with sometimes one or two people running supercomputer centers at various universities.

http://www.nas.nasa.gov/hecc/resources/pleiades.html

nl 3564 days ago

Yes, but old-school MPI style supercomputer clusters are closer to Hadoop style clusters than standalone machines for the purpose of this discussion.

Both have mechanisms for doing distributed processing on data that is too big for a single machine.

The original argument was that command line tools etc are sufficient. In both these cases they aren't.

semi-extrinsic 3565 days ago

Well, this is actually covered in the accompanying blogpost (link in comments below), and he makes a salient point:

"At the same time, it is worth understanding which of these features are boons, and which are the tail wagging the dog. We go to EC2 because it is too expensive to meet the hardware requirements of these systems locally, and fault-tolerance is only important because we have involved so many machines."

Implicitly: the features you mention are only fixes introduced to solve problems that were caused by the chosen approach in the first place.

aub3bhat 3565 days ago

"The features you mention are only fixes introduced to solve problems that were caused by the chosen approach in the first place."

The chosen approach is the only choice! There is a reason why smart people at thousands of companies use Hadoop. Fault-tolerance and Multi-user support are not mere externalities of the chosen approach but fundamental to performing data science in any organization.

Before you further comment, I highly highly encourage you to get a "Real world" experience in data science by working at a large or even medium sized company. You will realize that outside of trading engines, "faster" is typically the third or fourth most important concern. For data and computed results to be used across organization, they need to stored centrally, similarly hadoop allows you to centralize not only data but also computations. When you take this into account, it does not matter how "Fast" command line tools are on your own laptop. Since now your speed, is determined by the slowest link, which is data transfer over the network.

semi-extrinsic 3565 days ago

"Gartner, Inc.'s 2015 Hadoop Adoption Study has found that investment remains tentative in the face of sizable challenges around business value and skills.

Despite considerable hype and reported successes for early adopters, 54 percent of survey respondents report no plans to invest at this time, while only 18 percent have plans to invest in Hadoop over the next two years. Furthermore, the early adopters don't appear to be championing for substantial Hadoop adoption over the next 24 months; in fact, there are fewer who plan to begin in the next two years than already have."

So lots of big businesses are doing just fine without Hadoop and have no plans for beginning to use it. This seems very much at odds with your statement that "The chosen approach is the only choice!"

In fact I would hazard a guess that for businesses that aren't primarily driven by internet pages, big data is generally not a good value proposition, simply because their "big data sets" are very diverse, specialised and mainly used by certain non-overlapping subgroups of the company. Take a car manufacturer, for instance. They will have really big data sets coming out of CFD and FEA analysis by the engineers. Then they will have a lot of complex data for assembly line documentation. Other data sets from quality assurance and testing. Then they will have data sets created by the sales people, other data sets created by accountants, etc. In all of these cases they will have bespoke data management and analysis tools, and the engineers won't want to look at the raw data from the sales team, etc.

sgt101 3565 days ago

My experience echos the OP of this thread; having data in one place backed by a compute engine that can be scaled is a huge boon. Enterprise structure, challenges and opportunities change really fast now, we have mergers new businesses, new products and the requirements to create evidence to support the business conversations that these generate is intense. A single data infrastructure cuts the time required to do this kind of thing from weeks to hours - I've had several engagements where the hadoop team has produced answers in an afternoon that were then later "confirmed" from the proprietary datawarehouses days or weeks later after query testing and firewall hell.

For us Hadoop "done right" is the only game in town for this usecase, because it's dirt cheap per TB and has a mass of tooling. It's true that we've underinvested, but mostly because we've been able to get away with it, but we are running 1000's of enterprise jobs a day through it and without it we would sink like a stone.

Or spend £50m.

semi-extrinsic 3565 days ago

Is there anything in your Hadoop that's not "business evidence", financials, user acquisition etc?

My point is that there are many many business decisions driven by analysing non-financial big data sets that physically cannot be done with data crunched out in five hours. These may even require physical testing or new data collection to validate your data analysis.

Like I mentioned, anyone doing proper Engineering (as in, professional liability) will have the same level of confidence in a number coming out of your Hadoop system as they would in a number their colleague Charlie calculated on a napkin at the bar after two beers. Same goes for people in the pharma/biomolecular/chemical industries, oil and gas, mining etc etc.

luckydata 3565 days ago

Yeah, that sounds fun, dozens of undocumented data silos without supervision that some poor bastard will have to troubleshoot as soon as the inevitable showstopper bug crops up.

manigandham 3565 days ago

Most medium and big enterprises have a working set of data around 1-2TB. Enough to fit in memory on a single machine these days.

pi-rat 3565 days ago

> .. cannot expect all of them to SSH into a single machine ..

That's pretty much how the Cray supercomputer worked at my old university. SSH to a single server containing compilers and tooling. Make sure any data you need is on the cluster's SAN. Run a few cli tools via SSH to schedule job, and bam - a few moments later your program is crunching data on several tens of thousands of cores.

qwertyuiop924 3565 days ago

But, as I pointed out in another comment, what about systems like Manta, which make transitioning from this sort of script to a full-on mapreduce cluster trivial?

Mind, I don't know the performance metrics for Manta vs Hadoop, but it's something to consider...

jcbeard 3565 days ago

Totally agree. It'd be relatively trivial to automate converting this script into a distributed application. Haven't checked Manta out, but I will. For ultimate performance though right now you could go for something like OpenMP + MPI which gets you scalability/fault-tolerance. In a few months you'll also be able to use something like RaftLib as a dataflow/stream processing API for distributed computation (almost ready to roll out the distributed back-end). MPI though has decades of research in HPC to make it the most robust distributed compute platform in existence (though not the most easy to use). You think your big data problems are big...nah, supercomputers were doing todays big data back in the late 90's. Just a totally different crowd with slightly different solutions. MPI is hard to use, Spark/Storm is much easier...but much slower.

aub3bhat 3565 days ago

From my experience organizations have adopted, Hive/Presto/Spark on top of Hadoop. Which actually solves a whole bunch of problems that "script" approach would not. With several added benefits. Executing scripts (cat, grep, uniq, sort) do not provide similar, benefits, while they might be faster. A dedicated solution such as Presto by Facebook will provide similar if not even faster results.

https://prestodb.io/

qwertyuiop924 3565 days ago

Ah, so it doesn't solve data storage, and runs SQL queries, which are less capable than UNIX commmands. If your data's stuck inside 15 SQL DBs, than that'd make sense, but a lot of data is just stored in flat files. And you know what's really good at analyzing flat files? Unix commands.

aub3bhat 3565 days ago

Did you even read it? Presto reads directly from HDFS, which is as close to distributed "flat files" as you can get. As far as "SQL being less capable than UNIX commands", you have got to be kidding me. SQL allows type checking, conversion, joins all of which are difficult if not impossible with grep | uniq | sort etc.

qwertyuiop924 3565 days ago

I read it.

>Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

That doesn't sound like HDFS to me. I mean, I assume it can read from HDFS, but Presto is backend agnostic. You could probably write code to run it on Manta. That would be neat for people who like Presto, I guess.

Type checking and conversions, no, and table joins only matter when you're handling relational data.

Also, how many formats can Presto handle? Unix utilities can handle just about any tabular data, and you can run them against non-tabular data in a pinch (although nobody reccomends it). I doubt Presto is that versitile.

jamiesonbecker 3565 days ago

> For any non-trivial organization ( > 5 users), you cannot expect all of them to SSH into a single machine.

That's exactly the use case we built Userify for (https://userify.com)

ende 3565 days ago

This is a very good point. I think many people are so caught up in bashing the hype train around big data and data science that they just casually dismiss these incredibly valid points. It's not necessarily about how big your data is right now, but how big your data will be in the very near future. Retooling then is often a lot more difficult than just tooling properly up front, so even if some Spark operations might seem to add unnecessary overhead right now the lower transition cost down the road is often worth it.

xorgar831 3565 days ago

I think the point is if you really have big data then it makes sense, but many shops add huge cost and complexity to projects where simpler tools would be more than adequate.

luckydata 3565 days ago

It's the tooling, not the size of the data. Using "big data ecosystem" tools allows you to use all kinds of useful things like Airflow for pipeline processing, Presto to query the data, Spark for enrichment and machine learning etc... all of that without moving the data, which simplifies greatly metadata management which has to be done if you're serious about things like data provenance and quality.

xorgar831 3565 days ago

A SQL db + Tableau are vastly more powerful and mature than those tools, they just can't do "big data", that's all.

luckydata 3563 days ago

the data preparation work involved in doing anything trivial with that setup is mind boggling. I'm going to assume that this very cavalier assumption comes from a place of relatively little knowledge. Do some of this and you'll realize very soon you'd rather shoot yourself than doing complex data prep on constantly evolving schema using schema-on-write tools.

qwertyuiop924 3565 days ago

...and how often do you need to do all that to a dataset?

ende 3565 days ago

That's definitely true too. Being able to accurately assess whether that need will ever exist (or not) early on is invaluable.

elcritch 3565 days ago

These are valid points, and I agree many underestimate the cost of retooling and infrastructure. However, I am working on a team of smart engineers, but shell scripting is new to them, much less learning a full Hadoop / spark setup and associated tools. Luckily, you can often have your cake and eat it too: https://apidocs.joyent.com/manta/example-line-count-by-exten... Super useful system so far, and my goal is to allow our team to learn some basic scripting techniques and then run them on our internal cloud using almost identical tooling. Plus things like simple Python scripts are really easy to teach, and with this infrastructure it can scale quickly!

jacques_chester 3565 days ago

Day 2 problems don't come to mind during days 0 and 1.

shitgoose 3565 days ago

Sometimes they don't come at all.

shitgoose 3565 days ago

That was quote from Austin Powers. And a reference to the fact that Day 2 requirements almost never come.

jcrites 3565 days ago

> [...] the benefit of having as much of your data processing jobs in one ecosystem as possible. [...] The main benefits are the fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations provided by the community.

Yep! Elasticity is a pretty nice benefit.

Sure, if you're processing a few gigabytes of data, then you could do that with shell scripts on a single machine. However, if you want to build a system that you can "set and forget", that will continue to run over time as data sizes grow arbitrarily, and that -- as you say -- can be fault tolerant, then distributed systems are nice for that purpose. The same job that handles the few gigabytes of data can scale to petabytes if needed. The same techniques that handle gigabytes scale to petabytes.

A job running on a single machine with shell scripts will eventually reach a limit where the data size exceeds what it can handle reasonably. I've seen this happen repeatedly first hand, to the extent that I'd be reluctant to use this approach in production unless I needed something really quick-n-dirty where scaling isn't a concern at all. Another problem with these single-machine solutions is their reliability. If it's for production use, you really want seamless, no-humans-involved failover, which isn't as straightforward to achieve with the single-machine approach unless you deploy somewhat specialized technology (it ends up being something like primary/standby with network attached storage).

Plus, in an environment where you have one job processing GiBs of data, you tend to have more. While any single solo job handling GiBs of data could be done locally, once you have a lot of them, accessed by many different people at a company and under different workflows, the value of distributed data infrastructure starts to make more sense.

Neat article though. Always good to have multiple techniques up your sleeve, to use the right one for the problem at hand.

micro_softy 3565 days ago

What happens if a consultant approaches a company that uses Hadoop and offers them "custom data transformation" solutions for their most frequent processing jobs at lower cost and that beat Hadoop's processing times 100-fold?

The company saves money by choosing the lower cost and time by not having to wait for slow Hadoop processing.

It seems like there is competitive advantage to be gained and money to be made by taking advantage of Hadoop's inefficiencies.

But then things are not always what they seem.

shitgoose 3565 days ago

This assumes that company will be making rational choice, which is rarely the case. The choice is usually made based on "no one got fired by using X".

jupiter90000 3565 days ago

Or/and "my resume will look good because big data and Hadoop and Spark and ..."