Hacker News new | ask | show | jobs
by jakjak123 980 days ago
I have done some similar, simpler data wrangling with xsv (https://github.com/BurntSushi/xsv) and jq. It could process my 800M rows in a couple of minutes (plus the time to read it out from the database =)
2 comments

A long time ago, we were trying to compare a couple of tables with a few hundred million rows in each to see whether the differences (due to a new way of processing) were allowable. Our local Oracle Boy whipped up a query, set it running, and we all sat around for hours whilst it churned - end result being we could do one comparison a day. After a while, I experimented with dumping the tables as CSV, through `sort`, and then using some Perl to compare each paired (or not!) line with some heuristics for quick rejection. That all took about 1-2 hours meaning we could get through three, maybe four, tests a day instead.
qsv has more features nowadays:

https://github.com/jqnatividad/qsv