| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by monstrado 4425 days ago
	You pay a significant resource penalty when using Serdes, and since performance is one of the biggest priorities to the Impala team, we decided to leave this out for now. A very common workaround is to use Hive to generate Parquet data from your custom data (using Serdes), and then use Impala for querying the Parquet data. I disagree with your statement regarding not treating benchmarks from vendors seriously. As the article mentions, we made an effort to make these queries run as efficient as possible, even going so far as re-writing queries on competing engines to make them run faster. In fact, Databrick's engineers assisted us in making the Shark benchmarks as good as they could possibly get. The benchmark that I linked is very thorough, and even supplies the exact queries / scripts we used to perform the tests so you can do them yourself.