|
|
|
|
|
by baud9600
80 days ago
|
|
Strange days we live in. Python and C++? What about a line of bash: tr -s '[:space:]' '\n' < file.txt | sort | uniq -c | sort -rn I’d like to know the memory profile of this. The bottleneck is obviously sort which buffers everything in memory. So if we replace this with awk using a hash map to keep count of unique words, then it’s a much smaller data set in memory: tr -s '[:space:]' '\n' < file.txt | awk '{c[$0]++} END{for(w in c) print c[w], w}' | sort -rn I’m guessing this will beat Python and C++? |
|
That's not obvious to me. I checked the manuals for sort(1) in GNU and FreeBSD, and neither of them buffer everything in memory by default. Instead they read chunks to an in-memory buffer, sort each chunk, and (if there are multiple chunks) use the filesystem as temporary storage for an external mergesort.
This sorting program was originally developed with memory-starved computers in mind, and the legacy shows.