Hacker News new | ask | show | jobs
by gtrubetskoy 3743 days ago
Actually, you might want to not choose any database at all, but instead focus on deciding on the data format, such as Parquet (http://parquet.io) or Avro (https://avro.apache.org/), etc. Many of the tools such as Hive, Impala, Spark, etc. support these formats natively.

You will also need to think about the schema, partitioning, compression and other parameters, and those are not trivial decisions.

1 comments

The data format is important. ORC/Parquet being substantially faster then Text or Sequence files.

But the query engines are far more important in terms of performance. Just spend any time with SparkSQL and then Hive and you'll know what I mean.