Hacker News new | ask | show | jobs
by jepcommenter 2234 days ago
As you sort by first field anyway, could you please try out omitting field split (-t, -k1)? For me it gives a noticeable improvement:

$ stat --printf="%s\n" p.csv

1258291200

$ time sort -t, -k1 -S100% -o sorted.csv p.csv

real 0m50,186s user 4m6,962s sys 0m4,562s

$ time sort -o sorted.csv p.csv

real 0m43,483s user 3m36,473s sys 0m4,282s

3 comments

Full line comparison would probably use memcmp that bails out on first non matching character while field-splitting overhead might be significant.
Exact same file has been taking more than 35 minutes to sort already, so it's slower without splitting.

Edit: it finished!

real 35m28.370s

user 40m17.129s

sys 4m31.081s

Where did you find the dataset, or did you construct your own?
It's a dataset related to balance changes of bitcoin addresses downsampled to daily resolution.

You could extract it from BigQuery's bitcoin public data.