| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by twotwotwo 1954 days ago

I want to second that plug for Athena for ad-hoc analysis. (If you're hosting your own stuff and at the scale where it'd be useful, there's Presto/Hive, which Athena is based on, and/or at Trino, the Presto fork maintained by some of its initial developers.)

It was useful for me when tweaking spam/bot detection rules a while ago; if I could roughly describe a rule in a query, I could back-test it on old traffic and follow up on questionable-looking results (e.g. what other requests did this IP make around the time of the suspicious ones?). We also used Athena on a project looking into performance, and on network flow logs. The lack of recurring charges for an always-on cluster makes it great for occasional use like that.

You can use what the docs call "partition projection" to efficiently limit the date range of logs to look at (https://docs.aws.amazon.com/athena/latest/ug/partition-proje...), so it was free-ish to experiment with a query on the last couple days of data before looking further back.

More generally, Athena/Presto/Hive support various data sources and formats (including applying regexps to text). Compressed plain-text formats like ALB logs can already be surprisingly cheap to store/scan. If you're producing/exporting data, it's worth looking into how these tools "like" to receive it--you may be able to use a more compact columnar format (Parquet or ORC) or take advantage of partitioning/bucketing (https://docs.aws.amazon.com/athena/latest/ug/partitions.html, https://trino.io/blog/2019/05/29/improved-hive-bucketing.htm...) for more efficient querying later.

As the blog post notes, usability was...imperfect, especially during initial setup. Error messages sometimes point at one of the first few tokens of the SQL, nowhere near the mistake, and there are lots of knobs to tweak, some controlled by 'magical.dotted.names.in.strings'. CLIs were sometimes easier than the GUI. But you can get a lot out of it once you've got it working!

1 comments

bitsondatadev 1953 days ago

One quick detail on the Trino description is that not only are some of the initial developers but all of the creators and the majority of contributors (https://github.com/prestodb/presto/graphs/contributors?from=...) and still have contributed the majority of the code in both Presto (https://github.com/prestodb/presto/graphs/contributors)/Trin... (https://github.com/trinodb/trino/graphs/contributors).

To really jump into this, take a look at https://trino.io/blog/2020/12/27/announcing-trino.html.

A few more stats and info:

Trino commits: 22,383 Presto commits: 18,582

Trino slack members: 3,603 Presto slack members: 1,575

Trino supports iceberg: https://trino.io/docs/current/connector/iceberg.html HDP3 Support: https://github.com/trinodb/trino/issues/1218

Trino has addressed a critical security vulnerability that still exists in Presto: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-1508...

Give our repo a star if you have a sec: https://github.com/trinodb/trino/blob/master/.github/star.pn...

link

twotwotwo 1953 days ago

Yeah, when I was looking into this stuff it was sad to see the blog post about why Trino had to switch names. Unsurprisingly given the history, the Trino blog was the only place I could find written-up details about things like how to name files so they work with the new version of bucketing. The new features look really neat and hope the project has (continues to have) success!

link

bitsondatadev 1952 days ago

Thanks! If you haven't, feel free to join our slack if you have any further questions. https://trino.io/slack.html

We're trying to improve the docs and blog about confusing topics. We also started a twitchcast to dig into various technical topics around Trino: https://www.twitch.tv/trinodb. You can catch old episodes here: https://trino.io/broadcast/episodes.html.

link

findepi 1952 days ago

Thanks!

You mean Hive bucketing v2? It was a fun project.

link

twotwotwo 1952 days ago

Yep! I had a situation where I wanted more than one file per bucket (for the sake of other code that uses the same data), and needed the improved bucketing for that.

link