|
|
|
|
|
by pebers
2863 days ago
|
|
Definitely some ambitious goals. There's nothing bad about that, but this has an awfully long way to go - e.g. searching for "hacker news" works fine, searching for almost anything else didn't find anything relevant. So while it's nice to say it can run in 1CPU / 1GB, I'm not sure it's very useful at that size (but I don't know how big it'd have to get to "break even" there). Anyway, noted that it's a very early version, so good luck with it! |
|
Yep, I have probably messed up the relevancy a bit because of constantly experimenting with how to load the model/index. Right now I'm using phrases (sentences) as well as words, both extracted during the tokenization process. Initially I used only phrases because using the current 65K vector-space model that would match any word to any phrase containing that word. There are perhaps sideeffects of reinforcing each word like that.
"long way to go"
I don't think so. The real bitch was to figure out how to maintain a good representation of the language model on disk. How to update it. Remove data from it. Now I anticipate a couple of months fine-tuning the balancing of the tree and testing relevance. From what I have heard so far, relevance is a little sub-par.
Scaling is the next thing. I have a great plan for that of course, mentioned somewhere in this thread.