|
|
|
|
|
by dangoldin
3479 days ago
|
|
We (adtech) use a very similar approach. We're consuming a ton of data through Kafka and then using Secor to store it on S3 as Parquet files. We then use Spark for both aggregations as well as ad-hoc analyses. One thing that sounds very interesting and worked surprisingly well when I played around with it was Amazon's Athena (https://aws.amazon.com/athena/) which lets you query Parquet data directly without relying on Spark which can get expensive quickly. I wouldn't trust production use cases just yet and it ties you more and more into the AWS ecosystem but might be worth exploring as a simple way to do basic queries on top of Parquet data. I suspect it's simply a managed service on top of Apache Drill (https://drill.apache.org/). |
|
Since s3 listing is so awful, and the huge number of partitions we needed, we had to write a custom connector that was aware of the file structure on s3, instead of the hive metastore which has lots of limitations, so im a little wary of athena. create table as select is amazing too, write sql to generate temporary parquet/orc files back to s3 to query later, i hope will support this if it doesn't already.