| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by baud9600 126 days ago

Strange days we live in. Python and C++? What about a line of bash:

tr -s '[:space:]' '\n' < file.txt | sort | uniq -c | sort -rn

I’d like to know the memory profile of this. The bottleneck is obviously sort which buffers everything in memory. So if we replace this with awk using a hash map to keep count of unique words, then it’s a much smaller data set in memory:

tr -s '[:space:]' '\n' < file.txt | awk '{c[$0]++} END{for(w in c) print c[w], w}' | sort -rn

I’m guessing this will beat Python and C++?

3 comments

pjscott 126 days ago

> I’d like to know the memory profile of this. The bottleneck is obviously sort which buffers everything in memory.

That's not obvious to me. I checked the manuals for sort(1) in GNU and FreeBSD, and neither of them buffer everything in memory by default. Instead they read chunks to an in-memory buffer, sort each chunk, and (if there are multiple chunks) use the filesystem as temporary storage for an external mergesort.

This sorting program was originally developed with memory-starved computers in mind, and the legacy shows.

link

knome 126 days ago

>which buffers everything in memory

gnu sort can spill to disk. it has a --buffer-size option if you want to manually control the RAM buffer size, and a --temporary-directory option for instructing it where to spill data to disk during sort if need be.

link

alok-g 126 days ago

Isn't using bash effectively saying, I have a bunch of functions already written in say C which I'll use but would not count those towards the lines of code? You could do the same in C and C++ itself too.

In other words, I am not sure if the comparison you are making is a fundamental one.

link