Hacker News new | ask | show | jobs
by pebers 2863 days ago
Definitely some ambitious goals. There's nothing bad about that, but this has an awfully long way to go - e.g. searching for "hacker news" works fine, searching for almost anything else didn't find anything relevant. So while it's nice to say it can run in 1CPU / 1GB, I'm not sure it's very useful at that size (but I don't know how big it'd have to get to "break even" there).

Anyway, noted that it's a very early version, so good luck with it!

1 comments

Thank you!

Yep, I have probably messed up the relevancy a bit because of constantly experimenting with how to load the model/index. Right now I'm using phrases (sentences) as well as words, both extracted during the tokenization process. Initially I used only phrases because using the current 65K vector-space model that would match any word to any phrase containing that word. There are perhaps sideeffects of reinforcing each word like that.

"long way to go"

I don't think so. The real bitch was to figure out how to maintain a good representation of the language model on disk. How to update it. Remove data from it. Now I anticipate a couple of months fine-tuning the balancing of the tree and testing relevance. From what I have heard so far, relevance is a little sub-par.

Scaling is the next thing. I have a great plan for that of course, mentioned somewhere in this thread.