Hacker News new | ask | show | jobs
by zeratul 3375 days ago
Ask HN: Could you suggest a fast library for converting documents into a sparse matrix representation (e.g., COO or CSR) in any programming language? I'm guessing C beats most of the implementation? But there is also the issue of efficient n-gram hashing/indexing.
2 comments

Scikit-Learn's text vectorizer stuff is good. In spaCy you can do:

    import spacy
    import numpy
    from spacy.attrs import LOWER, IS_STOP

    nlp = spacy.load('en')
    doc = nlp(u'The quick brown fox...')
    array = doc.to_array([LOWER, IS_STOP])
    content = array[1, numpy.nonzero(array[0])]
Personally I normally work in Cython when it needs to be fast. I find this more productive and more readable than trying to guess what numpy operations will be fast. So I would be doing:

    cdef void get_tokens(uint64_t* content, Doc doc) nogil:
        for i in range(doc.length):
            token = &doc.c[i] 
            if Lexeme.c_check_flag(token.lex.flags, IS_STOP):
                content[i] = token.lex.lower
Spacy seems quite fast while being reasonably accurate: https://spacy.io/docs/api/#benchmarks