Hacker News new | ask | show | jobs
by joaodlf 3479 days ago
There is some sort of hope, Apache Arrow is (in my opinion) a step in the right direction - A common In-Memory data layer for storage and data analysis systems? Yes please. It's important to start thinking about how all these big data storage/analytics tools can bridge the gap between themselves, hopefully projects like Apache Arrow will help... As long as there is adoption.
2 comments

A common in-memory columnar data layer would make a lot of sense because a) columnar is generally better for analytics, and b) converting from one columnar format to another can theoretically be done without decompression because columnar data is typically compressed using standard algorithms (vocabulary compression, LRE, etc). Here I wrote a few suggestions for such open-source data layer: http://bi-review.blogspot.ca/2015/06/the-world-needs-open-so...
Have you seen the Apache Arrow project? https://arrow.apache.org/
Good read!
I've looked at the code and messed around with arrow. It seems like a performance optimization that solves a small sliver of the problem. It could help with the parquet/thrift version issues they mentioned. But I don't see any guarantee it won't introduce its own version and compatibility problems. If the initial implementations are buggy like described in TFA it could actually be a lot worse.

In general, I've learned to be skeptical of any new big data solution. Hadoop and hive are clumsy but as someone on my team said "they've found and fixed the tens of thousands of bugs".

It seems to take five years before any significant new solution is stable and reliable enough to be used on large, complex workloads.

Which makes me really uncertain how we get out of this situation. Maybe something like arrow is a silver bullet that fixes everything with minimal complexity and thus few bugs. But I'm skeptical.