| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by monstrado 4418 days ago

Yes, the HBase scanners in Impala are not very fast, and we know that. This is an area that needs improvement to maximize parallelism, but as of right now there are a bunch of things on the Impala roadmap that takes priority (disk-based aggregations/joins, window functions, nested data, order by without limit) to name a few.

As for Parquet, that file format is not designed for streaming, but instead is like you mentioned, it's meant for converting large datasets that you plan on running analytics on. Queries against data in parquet is _fast_, like really fast...I've seen queries go from 200 seconds down to 5 seconds by just converting the dataset to Parquet from text.

Concurrency in Impala is actually pretty good, and has always been a design goal from the beginning. I wouldn't compare Impala to Vertica or other analytical databases just yet, there's still a lot of room for improvement, but concurrency in Impala is much better than the other SQL on Hadoop engines (Hive, Presto, etc), and we've demonstrated that on our latest rounds of benchmarks.

BTW #1 - As I mentioned, HBase support Impala is pretty minimal at the moment, but still works fine for ad-hoc queries over small key spaces.

BTW #2 - hehe! I'll let our marketing team know :P

BTW #3 - I'm not sure what you mean here, I'll ask around to see if someone knows.

Thanks for your kind words! Sorry for the late response, I just saw that I had a response :)