Hacker News new | ask | show | jobs
by haohui 3173 days ago
Hello, Haohui here. I'm one of the developer of AthenaX. I'm really glad that the project attracts so many interests.

Before AthenaX most of our real-time analytic pipelines were on Samza, which to to some extent can be seen as predecessor of Kafka Stream.

The migration from Samza towards Kafka Stream might seem more natural, and we actually took a very close look on Kafka Stream and we have decided to move towards Flink, given that:

(1) Kafka Stream lacks of important features like exactly-once delivery and distributed consistent snapshots. They are essential to support use cases that require high fidelity.

(2) SQL is a must-have feature in order to empower our users, many of which are non-technical, to run large-scale streaming analytics in production. Kafka Streams provide no support for that. Arguably the Kafka Streams provide simple APIs -- Simple APIs themselves are insufficient to bring the analytics applications to production. You can't ignore continuous integrations and deployment, monitoring, etc., especially many of our users come from non-CS backgrounds.

(3) It seems that the Apache Flink community is more open and committed compared to the Kafka community, particularly on the SQL side. We have collaborated with Data Artisans, Alibaba and Huawei. All parties above, as well as us are committed and equipped with adequate resources to bring SQL to respective customers.

2 comments

How do you see this compared to Prestodb, which also provides SQL for Kafka topics? Being able to join Kafka data with other tables seems like a useful benefit of Prestodb?
It would reduce the latency of data but we have not tried in production.

Given that Kafka neither provides secondary indexes nor organizes data in columnar format, Presto essentially have to somewhat scan through the Kafka topics to execute the queries, resulting a lot of disk I/O.

Our Kafka infrastructure handles more than one trillion messages per day and guarantees second-level latency SLAs. Reading aggressively could easily saturate all the I/O bandwidths of the nodes and leads to outages. We actually had several incidents in the past when we did backfills. So I'm more conservative on this.

Have you used Presto for doing sql queries on kafka topics ? Would be interested to see some experience reports on using this in production.

I have used Presto by the Hive connector and the results were pretty nice.

I'm curious to know what kind of frontend UI you have on top of this platform.
Unfortunately the UI did not make it to the first version of the open-source release. It's low priority given the fact that the UI itself usually heavily tied to the business needs.

We internally have a React-based UI and we are in the process of cleaning it up and opening sourcing it. Please stay tuned!