Hacker News new | ask | show | jobs
by edw 3410 days ago
Back in '10, I needed a three or four node Hadoop cluster just to match the performance I was getting using a spare Mac mini in development mode when I was doing a lot of work in Cascalog, which is based on Cascading.

Most problems are not Big Data problems. The size a problem must be before it qualifies as a Big-Data problem grows larger every day with the availability of machines with ever-more cores and memory. `Sed`, `awk`, `grep`, `sort`, `join`, and so forth are some of the least appreciated tools in the Unix toolbox.

People want to think they have Big Data problems but they probably just have plain old normal-data problems. I have had to unwind the ridiculous, heavy-weight, Big Data solutions to normal-data problems that "kids today" love.

If you don't work for Netflix or Google or Facebook or insert maybe a hundred other companies here, you probably do not have a Big Data problem.

8 comments

I'm mostly convinced that most companies interested in Big Data stuff are not as interested in the scale of the problem but that they want to create "data lakes" to unite thousands of different forms of data that exist in their organization under a federated, centralized database of some sort. But most of us experienced in either enterprise companies or machine learning is that data quality is the primary problem that almost nobody actually can solve without brute force human eyeballs, which simply won't scale with the amount of data pouring in. So now there's interest in machine learning primarily to try to do that instead of people.
At the risk of sounding cynical, there are also companies out there that want to _appear_ to be interested in all the things you just mentioned, so they'll hire a few people to do their [wave hands] data science, machine learning "thing" and those few people then go down the rabbit hole, untethered from reality. The C-suite people will then have their message they can deliver externally—and internally—about their commitment to [wave hands again] all that stuff. The tragedy when this happens is that sometimes that small team of people is not aware what their function at the company is—being the human props for the sales and marketing teams. The alternative, I suppose, is far more depressing—that the "data scientists" are fully aware of their role in the grand scheme of things.
Woah, that was exactly my previous position.

Upside: I had one of the most paying position among the technical people. I also got to play with expensive stuff.

Downside: it was soul crushing, I was delivering no value whatsoever and had a really hard time looking at my colleagues in the eye, as they were making a third of my salary (at best).

I got out, joined a new company with that in mind and now have a very exciting job. They do have a dedicated R&D, Data Science team which is a shit show: absolutely brilliant people completely wasted as they can't build anything for lack of programming/technology experience, in an environment where their theoretical skills are mostly useless. I'm genuinely sad for them.

[edit: one of the company still has a Hadoop/Spark cluster for their whooping 500Mb of data]

You really just described me. I've to do such a show-case project to complete my MSc Thesis for a minimum pay. And make all the proofs, so that the C-Suite guys can use during their presentations to sell what I made for a lots of money. (Even unaware if they'll charge their customers in the hundred thausands or millions range).

However, I'm really glad to find out about SnappyData.io, that's gonna save me a lot of time waiting. It would truly be my perfect dream, if they allowed running any programming language inside an environment like Jupyter.org or BeakerNotebook.com, but with Pandoc.org Markdown. So that I can essentially work fulltime programming, while I can also document it and also be able to export my documentation to a good looking latex thesis.

SnappyData has a Zeppellin interpreter and the code is open source. So if adding a interpreter for jupyter is something that can be easily added, I am sure someone from the community would find it a interesting project to undertake. Agree that it would be useful
This is so true. For what it's worth, there are probably a lot of people who would give anything to be a human prop earning a salary significantly above $100k (even far outside of SF). For someone who's actually interested in data science, this would be miserable.
This is excruciatingly accurate.
+1
>If you don't work for Netflix or Google or Facebook or insert maybe a hundred other companies here, you probably do not have a Big Data problem.

I disagree over here. I have worked across multiple scenarios which warranted big data solutions and such solutions were not feasible before Apache Spark and such were available. Even our current startup (www.aihello.com) has 8.7 million products and calculating LDA + Cosine Similarity reaches trillions of matrices which is simply not feasible with traditional tools.

Telstra/Sensis, the telecom company in Australia that I consulted for, went from a month delayed reporting to near real time reporting due to apache spark.

Also keep in mind that the scale of data is growing exponentially for all of us since storage is getting cheaper and big data analysis is proving game changer in many scenarios.

Being able to do things like churn prediction and net promoter score in real time was one of the motivations for creating SnappyData. You get the ability to mutate data (think KPI maintenance in memory without having to jump across products) , and do joins etc. on streams, which makes things a lot simpler
Amen. Also said as: too big for excel is not big data. See also https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
Amen right back at ya! (I love the O'Reilly book cover.) I highly recommend people read your blog post. And there's also this classic:

https://aadrake.com/command-line-tools-can-be-235x-faster-th...

I'll also take this opportunity to plug Make and Drake for manipulating data in a replicable way:

https://bost.ocks.org/mike/make/

https://github.com/Factual/drake

If you're processing data using tools that cannot trace their ancestry directly to some time before 1985, you're probably wasting your own and your colleagues' time.

Just for clarification, I'm not the original blogger. +10 for the other link and using make ! I don't know Drake however.
The picture has now gotten a little fuzzier as this blog post conflates map reduce and YARN and calls them both hadoop. The scala pseudo code is just about exactly what you'd use with spark which runs on YARN.
I think his point is that bloated, over-engineered Big Data systems—whether batch or streaming—are overkill for the vast majority of problems.
There are just many points that don't really apply to stuff like spark or tez that runs on YARN:

ex: Hadoop << SQL, Python Scripts

I completely agree with

Mapreduce << SQL, Python Scripts

I do a lot of my processing on sparkSQL and through RDD transformations as opposed to Mapreduce limiting, slow KV style processing.

Thanks for the link. The replies in it were hilariously obvious.
There used to be a little web site where you'd fill out a little form that asked you how much data you had, and it would provide a list of commercially available hardware that could be bought or configured to handle it on a single machine. I think it would even give you a link to someplace you could order that piece of hardware.

This reminds me of the time, way back when, that a coworker told me about how our customer was filling a rack with a terabyte of hard drives. My eyes bulged a little bit to think of it. Now I chuckle to think that the laptop I had two laptops ago had a terabyte drive in it.

https://twitter.com/garybernhardt/status/600783770925420546

> Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.

Considering https://www.supermicro.com/products/system/4U/8048/SYS-8048B... which is a plain old 4U server not some fancy, super expensive NUMA machine can eat up 12TB memory, this quip and parent has quite some merits.

6TB is not even https://memory.net/product/s26361-f3843-e618-fujitsu-1x-64gb... horrible at 57 504 dollars. That's about 48 engineering days if your engineer related expenses are 150 an hour (and it's likely they are more).

Note: https://www.sgi.com/products/servers/uv/uv_300_30ex.html

> SGI UV 300 now scales up to 64 CPU sockets and 64TB of cache-coherent shared memory in a single system.

This is the current limit of Linux hardware memory support so going above it is tricky. But still, 64TB.

One nice thing about Hadoop is you get free distributed apps. On my current project, we only read about 50 TB of data per run across 70ish fraud models. Some read 1.5 TB, others only 20 GB. On a single system, that kind of data reading would require some smart I\O partitioning across the various models (multiples read the 1.5 TB data, all read the same 20 GB [after 20 GB you're looking at history that expands to the 1.5 TB]). With Hadoop, even just Map Reduce and Cascading, you can spin all of that work out to multiple computers. Since they have the data copied over multiple drives on those multiple computers, the I\O and general scheduling are handled for us. In the end, it makes everything simpler. If something fails due to network hiccups or disk failures, Hadoop moves the job and starts it again.
I think this is only half the story.

There are other use cases other than mere size that can necessitate "big data" solutions. E.g. timeliness, resiliency, maintainability...

If you are building production data processing systems that have constraints on data size, latency, resiliency, scheduling, dependency management, etc., you might be better off with a "big data" system. Even if the data could all fit on a beefy box. This was a painful lesson for me to learn.

Hm. Size is mere size. Latency will never improve with a "big data" solution over one machine with in-RAM data. Dependency management? You're going to declare it once and impose it everywhere anyway. Scheduling? Again, one machine with in-RAM data will always win.

That leaves resiliency and etc. I can't answer etc., but—how is resilience helped with a big data solution? That seems like Lampson's distributed system: more machines, but you need k-of-n, k>1. Better to just mirror to two machines with the data in RAM.

Latency does improve if you download your data in parallel across a cluster. Or if you're running many iterations of an algorithm over GBs of data thousands of times, and each iteration is independent--you can save hours or days by performing them in parallel on a cluster.

If your scheduling involves running jobs that must wait on dependencies or events for a long time (hours, days), a hardware failure or some other anomaly can be catastrophic, whereas a "big data" framework can recover without your even knowing about it.

At the end of the day it just comes down to use cases. There are a LOT of other use cases that "big data" platforms address other than being able to fit data in RAM. Sometimes flying by the seat of your pants on one host doesn't cut it for business-critical processing.

> how is resilience helped with a big data solution?

The "R" in Spark's RDD abstraction is for "Resilient". Node failures and replication failures can be recovered without you even knowing it.

Sure, you can write all this stuff from scratch every time you encounter them (mirror data on hosts, run embarrassingly-parallel algorithms across a fleet of hosts, write your own DB-backed scheduling system, etc.), but all these are solved problems in these big data frameworks. You'll be wasting tons of time reinventing the wheel. I've been there.

Is it even possible to get even 10TB of ram on a single commodity server?
The Super Micro SuperServer 8048B-TR4FT lists that it supports up to 12TB DDR4 ECC RAM (which could have 4xE7-8890v4 for 96 cores / 192 threads). Close to a commodity server, but probably doesn't quite count. Taking a wild guess on the price - $250k-$350k?

The SuperServer 7088B-TR4FT lists that it supports 24TB DDR4 ECC RAM (with 8xE78890v4 for 192 cores / 384 threads).

This opinion is stated frequently. What size of data is big data, in your opinion?
My rule: If it can fit on a single hard drive, it's not big data.
There are WAY more than 100 companies that meet that requirement.