Hacker News new | ask | show | jobs
by dangoldin 3479 days ago
We (adtech) use a very similar approach. We're consuming a ton of data through Kafka and then using Secor to store it on S3 as Parquet files. We then use Spark for both aggregations as well as ad-hoc analyses.

One thing that sounds very interesting and worked surprisingly well when I played around with it was Amazon's Athena (https://aws.amazon.com/athena/) which lets you query Parquet data directly without relying on Spark which can get expensive quickly. I wouldn't trust production use cases just yet and it ties you more and more into the AWS ecosystem but might be worth exploring as a simple way to do basic queries on top of Parquet data. I suspect it's simply a managed service on top of Apache Drill (https://drill.apache.org/).

1 comments

not drill, its on top of presto. presto is quite good, but the open source s3 support is definitely second class because fb doesnt use it, hopefully aws is contributing their connector back. likewise, fb use orc, and parquet is more externally supported.

Since s3 listing is so awful, and the huge number of partitions we needed, we had to write a custom connector that was aware of the file structure on s3, instead of the hive metastore which has lots of limitations, so im a little wary of athena. create table as select is amazing too, write sql to generate temporary parquet/orc files back to s3 to query later, i hope will support this if it doesn't already.