| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joaodlf 3479 days ago
	There is some sort of hope, Apache Arrow is (in my opinion) a step in the right direction - A common In-Memory data layer for storage and data analysis systems? Yes please. It's important to start thinking about how all these big data storage/analytics tools can bridge the gap between themselves, hopefully projects like Apache Arrow will help... As long as there is adoption.

2 comments

dgudkov 3478 days ago

A common in-memory columnar data layer would make a lot of sense because a) columnar is generally better for analytics, and b) converting from one columnar format to another can theoretically be done without decompression because columnar data is typically compressed using standard algorithms (vocabulary compression, LRE, etc). Here I wrote a few suggestions for such open-source data layer: http://bi-review.blogspot.ca/2015/06/the-world-needs-open-so...

link

infinite8s 3471 days ago

Have you seen the Apache Arrow project? https://arrow.apache.org/

link

joaodlf 3478 days ago

Good read!

link

bhntr3 3479 days ago

I've looked at the code and messed around with arrow. It seems like a performance optimization that solves a small sliver of the problem. It could help with the parquet/thrift version issues they mentioned. But I don't see any guarantee it won't introduce its own version and compatibility problems. If the initial implementations are buggy like described in TFA it could actually be a lot worse.

In general, I've learned to be skeptical of any new big data solution. Hadoop and hive are clumsy but as someone on my team said "they've found and fixed the tens of thousands of bugs".

It seems to take five years before any significant new solution is stable and reliable enough to be used on large, complex workloads.

Which makes me really uncertain how we get out of this situation. Maybe something like arrow is a silver bullet that fixes everything with minimal complexity and thus few bugs. But I'm skeptical.

link