Hacker News new | ask | show | jobs
by d_burfoot 1261 days ago
Can someone comment on the code quality of Arrow vs other Apache data engineering tools?

I have been burned so many times by amateur hour software engineering failures from the Apache world, that it’s very hard for me to ever willingly adopt anything from that brand again. Just put it in gripped JSon or TSV and hey, if there’s a performance penalty, it’s better to pay a bit more for cloud compute than hate your job because of some nonsense dependency issue caused by an org.Apache library failing to follow proper versioning guidelines.

2 comments

Arrow the format is pretty good, there are occasional quirks (null bitmap has 1 = non-null etc) but no big deal.

From my experience Arrow the C++ implementation is pretty solid too, though I don't like it (taste). I just don't like their "force std::shared_ptr over Array, Table, Schema and basically everything" approach, why don't use an intrusive ref count if the object could only be hold by shared_ptr anyways? There are also a lot of const std::shared_ptr<Array>& arguments on not-obvious-when-it-takes-ownership functions. And immutable Array + ArrayBuilder (versus COW/switch between mutable uniquely owned and immutable shared in ClickHouse and friends), so if you have to fill the data out of order you are forced to buffer your data on your side.

Do note that the compute engine (e.g. Velox) may still need to implement their own (Arrow compatible) array types as there aren't many fancy encodings in Arrow the format.

Arrow (and the ecosystem around it that I've looked into, namely DataFusion) seems really solid and well-engineered to me.