|
|
|
|
|
by syllogism
3377 days ago
|
|
Scikit-Learn's text vectorizer stuff is good. In spaCy you can do: import spacy
import numpy
from spacy.attrs import LOWER, IS_STOP
nlp = spacy.load('en')
doc = nlp(u'The quick brown fox...')
array = doc.to_array([LOWER, IS_STOP])
content = array[1, numpy.nonzero(array[0])]
Personally I normally work in Cython when it needs to be fast. I find this more productive and more readable than trying to guess what numpy operations will be fast. So I would be doing: cdef void get_tokens(uint64_t* content, Doc doc) nogil:
for i in range(doc.length):
token = &doc.c[i]
if Lexeme.c_check_flag(token.lex.flags, IS_STOP):
content[i] = token.lex.lower
|
|