| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pdonis 5116 days ago

You're correct, the Python version is O(N) instead of O(1) in RAM usage for two reasons, one easily fixed and one not. Sorry if I'm belaboring the point, but I think it's worth going into a little more detail.

The easily fixable issue is that the shell pipeline buffers reads, whereas my Python version just uses sys.stdin.read() for simplicity. I could have buffered the reads by using Python's generators/iterators instead. However, that alone isn't enough to get O(1) RAM usage.

The not so easily fixable issue is that the Unix sort command uses temporary files in order to not have to have all of its data in memory at once. See, for example, here:

http://vkundeti.blogspot.com/2008/03/tech-algorithmic-detail...

Python's sorted builtin doesn't work like this; it can take a generator as input, but it returns a list. This is one respect in which Python's built-in functionality lacks a feature that the Unix utilities have. I don't know if anyone has tried to re-implement the Python sorted function to use temporary files and return a generator to reduce memory usage.

[Edit: the Python version would also have to re-implement the uniq function to take a generator as input and return a generator, which would, I believe, also require temporary files.]

2 comments

websiteguy 5116 days ago

Unix sort is a merge sort, using fixed RAM uniq -c is line by line sed is line by line

A superior solution in every way

If you were going to do this in Python (or another similar language), you need to write a sort function that operates on a file, not a Python collection, as the collection is always bound by RAM, which would at best be re-implementing the sort command. Once you can sort a file, everything else is trivial, as counts can be done line by line, or more siply, via uniq -c.

Of course, if you only care about things that fit in memory, you can do it in Python, but it is still far easier to use command line for these type of problems.

link

pdonis 5116 days ago

> Of course, if you only care about things that fit in memory, you can do it in Python, but it is still far easier to use command line for these type of problems.

I agree; I was not trying to claim that my Python solution should be used in preference to the shell pipeline solution in any kind of "production" environment. As I noted in another comment, Knuth's Pascal solution appears to be open to the same criticism.

link

pdonis 5116 days ago

Looking at the original paper that was linked to elsewhere in the comments here, it appears that Knuth's original Pascal program is, like my Python version, O(N) in RAM. It keeps everything in memory and uses no temporary files, as far as I can tell.

link