| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aleksiy123 1508 days ago

Seems like you should just go with what you you know best.

Taking 10x longer doesn't seem like a language problem. If you don't know bash well you're going to take even longer to do it in bash than in python.

In any case the task you described is pretty much the same in python as in bash. At worst the python is going to be more more verbose.

   python -c "print(len(set(w for l in list(open('test.txt')) for w in l.split())))"

   tr ' ' '\n' < file_name | sort | uniq -c | wc -l

1 comments

t43562 1507 days ago

The shell's advantage is that of the pipeline components don't need to suck the whole file in so it can potentially operate on much larger files without running out of memory. I think only "sort" is problematic and at least it's a merge sort.

In Python you could use a generator but it would get a little more complicated and you'd still have to add all the words to set() but hopefully the number of different words is not that great.

The trie approach is quite memory efficient and that can matter.

link

aleksiy123 1506 days ago

I'm fairly sure `open` is a generator and doesn't load the whole file into memory. So you wouldn't hit a memory error unless like you said the amount of unique words is high enough.

link

t43562 1506 days ago

I think you're right but I believe that wrapping it in List(...) is where that would force the whole file into memory.

link

aleksiy123 1506 days ago

Yeah, you're right, that's my mistake.

I think you can just omit it but yeah...

link