Hacker News new | ask | show | jobs
by jillesvangurp 2757 days ago
If you have json line formatted stuff (or csv) and an aws account, you can do some nice things with Athena and SQL. We have a few simple backoffice tools that I've implemented around simple sql queries on data dumped from various systems that we have in json format. Awesome, if you want to do some quick selects, joins, etc.

If you are going to process this amount of data, don't load it all into memory and process line by line. Also do that concurrently if you have more than one CPU core available. I've done this with ruby, python, Java, and misc shell tools like jq. Use what you are comfortable with and what gets results quickly.

One neat trick with jq is to use it to convert json objects to csv and to then pipe that into csvkit for some quick and dirty sql querying. Generally gets tedious beyond a few hundred MB. I recommend switching to Athena or something similar if that becomes a regular thing for you.

1 comments

OP here— good point! We actually use Athena to query these exports in S3 to debug data drift of specific export objects over time. It's quite a useful tool, I was able to go knowing basically nothing about Athena to querying gzipped newline-delimited JSON files in S3 using SQL in about an hour.