| HN Mirror

Scikit-Learn's text vectorizer stuff is good. In spaCy you can do:

    import spacy
    import numpy
    from spacy.attrs import LOWER, IS_STOP

    nlp = spacy.load('en')
    doc = nlp(u'The quick brown fox...')
    array = doc.to_array([LOWER, IS_STOP])
    content = array[1, numpy.nonzero(array[0])]

Personally I normally work in Cython when it needs to be fast. I find this more productive and more readable than trying to guess what numpy operations will be fast. So I would be doing:

    cdef void get_tokens(uint64_t* content, Doc doc) nogil:
        for i in range(doc.length):
            token = &doc.c[i] 
            if Lexeme.c_check_flag(token.lex.flags, IS_STOP):
                content[i] = token.lex.lower