Hacker News new | ask | show | jobs
by misterman0 2863 days ago
Thank you!

Yep, I have probably messed up the relevancy a bit because of constantly experimenting with how to load the model/index. Right now I'm using phrases (sentences) as well as words, both extracted during the tokenization process. Initially I used only phrases because using the current 65K vector-space model that would match any word to any phrase containing that word. There are perhaps sideeffects of reinforcing each word like that.

"long way to go"

I don't think so. The real bitch was to figure out how to maintain a good representation of the language model on disk. How to update it. Remove data from it. Now I anticipate a couple of months fine-tuning the balancing of the tree and testing relevance. From what I have heard so far, relevance is a little sub-par.

Scaling is the next thing. I have a great plan for that of course, mentioned somewhere in this thread.