Hacker News new | ask | show | jobs
by scaleout1 3172 days ago
Interesting project although cant say I am happy to see SQL being used in Streaming Systems like this. In my last two jobs I had to write frameworks and tools to enable "Data Scientists" and "analysts" to write production jobs and problem I have run into with exposing SQL to this class of user is that every job end up being its own special snowflake with deeply nested SQL with custom UDFs mixed in for good measure. Due to "unique" nature of each its significantly increases the support and maintainability cost. I have to come the conclusion that a typesafe api with map/filter/flatmap is much better API to expose that Stringly typed SQL. I am curious to know whether Uber is running into similar support issues?
3 comments

Our experience is that AthenaX actually lowers our support costs:

(1) There are significant loads on consultations when users had to implement their own jobs in Java / Scala and run them in production. Sometimes it turned in to co-development as the users lack the expertise of the streaming analytics frameworks.

(2) We consciously encourage our users to write good SQLs via: (a) enforcing schemas on all analytical Kafka topics. (b) setting up a team dedicated to help them using SQL in big data systems (i.e., Hive, Presto, AthenaX, etc.)

For UDF we provide general guidances and ask our users to oncall for the jobs that use UDFs. The support costs are definitely not zero but it is still much better to teach users to write a Samza / Flink / Storm job from scratch.

My experience teaching some graduates in a BI shop. SQL is more common, and tools that support SQL tend to be used better.

I've "taught" them how to use Spark, but being a team of varying prior experience, the Scala API meant them learning Scala, the Python one was a bit better, but they did much better with the SQL DSL.

Regarding your concern re maintainability: UDF's tend to be the problem, I'm also curious to know re their support issues, and also: can anyone write their own UDF (the code requires registering a .jar), or is there a team that helps business users in that regard?

For UDF we support both -- users can write their own UDFs in their queries. We provide a number of well-tested UDFs to them as well.
Is this a problem with SQL or your users?