|
|
|
|
|
by websiteguy
5069 days ago
|
|
Maybe I am missing something here, but The python version is:
O(N) sequential disk access
O(NlogN) cpu
O(N) RAM The shell script is the same for disk and cpu, however, it is O(1) of RAM, and therefor can operate on input of size limited to disk, not RAM |
|
The easily fixable issue is that the shell pipeline buffers reads, whereas my Python version just uses sys.stdin.read() for simplicity. I could have buffered the reads by using Python's generators/iterators instead. However, that alone isn't enough to get O(1) RAM usage.
The not so easily fixable issue is that the Unix sort command uses temporary files in order to not have to have all of its data in memory at once. See, for example, here:
http://vkundeti.blogspot.com/2008/03/tech-algorithmic-detail...
Python's sorted builtin doesn't work like this; it can take a generator as input, but it returns a list. This is one respect in which Python's built-in functionality lacks a feature that the Unix utilities have. I don't know if anyone has tried to re-implement the Python sorted function to use temporary files and return a generator to reduce memory usage.
[Edit: the Python version would also have to re-implement the uniq function to take a generator as input and return a generator, which would, I believe, also require temporary files.]