|
|
|
|
|
by pdonis
5070 days ago
|
|
You're correct, the Python version is O(N) instead of O(1) in RAM usage for two reasons, one easily fixed and one not. Sorry if I'm belaboring the point, but I think it's worth going into a little more detail. The easily fixable issue is that the shell pipeline buffers reads, whereas my Python version just uses sys.stdin.read() for simplicity. I could have buffered the reads by using Python's generators/iterators instead. However, that alone isn't enough to get O(1) RAM usage. The not so easily fixable issue is that the Unix sort command uses temporary files in order to not have to have all of its data in memory at once. See, for example, here: http://vkundeti.blogspot.com/2008/03/tech-algorithmic-detail... Python's sorted builtin doesn't work like this; it can take a generator as input, but it returns a list. This is one respect in which Python's built-in functionality lacks a feature that the Unix utilities have. I don't know if anyone has tried to re-implement the Python sorted function to use temporary files and return a generator to reduce memory usage. [Edit: the Python version would also have to re-implement the uniq function to take a generator as input and return a generator, which would, I believe, also require temporary files.] |
|
A superior solution in every way
If you were going to do this in Python (or another similar language), you need to write a sort function that operates on a file, not a Python collection, as the collection is always bound by RAM, which would at best be re-implementing the sort command. Once you can sort a file, everything else is trivial, as counts can be done line by line, or more siply, via uniq -c.
Of course, if you only care about things that fit in memory, you can do it in Python, but it is still far easier to use command line for these type of problems.