Hacker News new | ask | show | jobs
by bhntr3 3477 days ago
Seems like a pretty typical set of problems. Dependency conflicts hard. Schema evolution hard. Upgrades hard.

The big data space still feels like an overengineered, fractured, buggy mess to me. I was hoping spark would simplify the user experience but it's as much of a clusterf*ck as anything else.

How hard can fast, reliable distributed computation and storage for petabytes of data be? He said ironically.

2 comments

Trivial. Just put everything in BigQuery. Use looker for visualization. Job done.

Any single other component you may try to add will increase the complexity factorially. Better stick to the basics ;)

IMO one major problem is integration between different projects. Like you said, its a hard problem, and any solution typically depends on many many different open source projects because of the scope of challenges. All of those projects go forward without much coordination between the teams because they're open source. Then we end up in this fun, fun clusterfuck.
There is some sort of hope, Apache Arrow is (in my opinion) a step in the right direction - A common In-Memory data layer for storage and data analysis systems? Yes please. It's important to start thinking about how all these big data storage/analytics tools can bridge the gap between themselves, hopefully projects like Apache Arrow will help... As long as there is adoption.
A common in-memory columnar data layer would make a lot of sense because a) columnar is generally better for analytics, and b) converting from one columnar format to another can theoretically be done without decompression because columnar data is typically compressed using standard algorithms (vocabulary compression, LRE, etc). Here I wrote a few suggestions for such open-source data layer: http://bi-review.blogspot.ca/2015/06/the-world-needs-open-so...
Have you seen the Apache Arrow project? https://arrow.apache.org/
Good read!
I've looked at the code and messed around with arrow. It seems like a performance optimization that solves a small sliver of the problem. It could help with the parquet/thrift version issues they mentioned. But I don't see any guarantee it won't introduce its own version and compatibility problems. If the initial implementations are buggy like described in TFA it could actually be a lot worse.

In general, I've learned to be skeptical of any new big data solution. Hadoop and hive are clumsy but as someone on my team said "they've found and fixed the tens of thousands of bugs".

It seems to take five years before any significant new solution is stable and reliable enough to be used on large, complex workloads.

Which makes me really uncertain how we get out of this situation. Maybe something like arrow is a silver bullet that fixes everything with minimal complexity and thus few bugs. But I'm skeptical.