Hacker News new | ask | show | jobs
by mjn 4494 days ago
If you need only a subset of full SQL, e.g. just joins, counting/aggregation, and date manipulation (as in several of these examples), I've found it fairly easy to work with a mixture of the standard Unix join(1) [found on nearly all systems], and some of the additions from Google's crush-tools (https://code.google.com/p/crush-tools/), mainly 'aggregate', 'grepfield', 'funiq', and 'convdate'. I find chaining them together a bit easier than writing monolithic SQL statements, but there's probably some crossover point at which that wouldn't be true.

It'd be interesting to compare runtimes as well. I would guess that there's some overhead in loading into the DB up front, but that you might gain some speedup by converting longer chains of Unix pipes into one SQL query. On the other hand you might lose some parallelism. Would take some testing on different kinds of queries and data sets to get an idea of the pros/cons I think.

1 comments

Another interesting alternative could be using or writing a PostgreSQL foreign data wrapper.

There is one that supports tabular data (file_fdw), and another one for JSON files (json_fdw). If you have files in other formats, you can also write your fdw for it. This way, you get complete SQL coverage.

Also, if you don't want to pay the overhead of parsing the file every time, you can use the new materialized feature for caching: http://www.postgresql.org/docs/9.3/static/rules-materialized...

(Disclaimer: Enthused Postgres user.)