|
|
|
|
|
by nowarninglabel
5006 days ago
|
|
Yes, but it does take quite some time to get that setup. I remember it took around two days on my server to get the data imported into MySQL. That said, thereafter, searching is a relatively solved problem, so I'd question the value of a C library, though I suppose it'd be useful in a case where you didn't/couldn't have put the data into a database or where you needed to parse new dumps all the time and didn't want to wait. |
|
The goal of my library is to enable quick data mining on wikipedia. Search is just one use case. As an example, you might want to build a content classifier to automatically categorize web pages into wikipedia categories (like politics, sports, etc). To go about doing this, you would need to parse wiki pages and extract features (like n-grams) for a particular category. The C library transforms plain wiki text to a parsed object, that you can use to extract what information you want. The only advantage is that it does this incredibly fast.