Hacker News new | ask | show | jobs
by tzot 77 days ago
Well, we can use memoryview for the dict generation avoiding creation of string objects until the time for the output:

    import re, operator
    def count_words(filename):
        with open(filename, 'rb') as fp:
            data= memoryview(fp.read())
        word_counts= {}
        for match in re.finditer(br'\S+', data):
            word= data[match.start(): match.end()]
            try:
                word_counts[word]+= 1
            except KeyError:
                word_counts[word]= 1
        word_counts= sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True)
        for word, count in word_counts:
            print(word.tobytes().decode(), count)
We could also use `mmap.mmap`.
2 comments

This doesn't do the same thing though, since it's not Unicode aware.

    >>> 'x\u2009   a'.split()
    ['x', 'a']
    # incorrect; in bytes mode, `\S` doesn't know about unicode whitespace
    >>> list(re.finditer(br'\S+', 'x\u2009   a'.encode()))
    [<re.Match object; span=(0, 4), match=b'x\xe2\x80\x89'>, <re.Match object; span=(7, 8), match=b'a'>]
    # correct, in unicode mode
    >>> list(re.finditer(r'\S+', 'x\u2009   a'))
    [<re.Match object; span=(0, 1), match='x'>, <re.Match object; span=(5, 6), match='a'>]
OP's .split_ascii() doesn't handle U+2009 as well.

edit: OP's fully native C++ version using Pystd

Hmm? Which code are you looking at?
There's bound to be a way to turn a stream of bytes into a stream of unicode code points (at least I think that's what python is doing for strings). Though I'm explicitly not volunteering to write the code for it.

    import mmap, codecs

    from collections import Counter

    def word_count(filepath):

        freq = Counter()
    
        decode = codecs.getincrementaldecoder('utf-8')().decode
    
        with open(filepath, 'rb') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        
                for chunk in iter(lambda: mm.read(65536), b''):
            
                        freq.update(decode(chunk).split())
            
                    freq.update(decode(b'', final=True).split())
        
                return freq
Oh that's neat, though I might split this into two functions in most cases, no need to entangle opening the file and counting the words in a filelike object.

That's two neat tricks that I'm definitely adding to my bag of python trickery.

Sure, but making one string from the file contents is surely much better than having a separate string per word in the original data.

... Ah, but I suppose the existing code hasn't avoided that anyway. (It's also creating regex match objects, but those get disposed each time through the loop.) I don't know that there's really a way around that. Given the file is barely a KB, I rather doubt that the illustrated techniques are going to move the needle.

In fact, it looks as though the entire data structure (whether a dict, Counter etc.) should a relatively small part of the total reported memory usage. The rest seems to be internal Python stuff.

I dislike loading files into memory entirely, in fact I consider avoiding that one of the few interesting problems here (the other problem being the issue of counting words in a stream of bytes, without converting the whole thing to a string).

If you don't care about efficiency you can just do len(set(text.split())), but that's barely worth making a function for.

For reasons I never quite understood python has a collections.Counter for the purpose of counting things. It's a bit cleaner.
> It's a bit cleaner.

That's pretty much the reason why. Raymond Hettinger explains the philosophy well while discussing the `random` standard library module: https://www.youtube.com/watch?v=Uwuv05aZ6ug

I feel like much of this has been forgotten of late, though. From what I've seen, i's really quite hard to get anything added to the standard library unless you're a core dev who's sufficiently well liked among other core devs, in which case you can pretty much just do it. Everyone else will (understandably) be put through a PhD thesis defense, then asked to try the idea out as a PyPI package first (and somehow also popularize the package), and then if it somehow catches on that way, get declined anyway because it's easy for everyone to just get it from PyPI (see e.g. Requests).

I personally was directed to PyPI once when I was proposing new methods for the builtin `str`. Where the entire point was not to have to import or instantiate anything.