What is the story for multi-language corpus? Do I have to do my own stop word pruning, tokenizing, lemming, etc? This is usually the case with full-text search solutions and it is a pain.
Re: stemming and lemming, I just want to plug the most impressive NLP stack I ever used, "chat script", really it's for building dialog trees where it walks down a branch of conversation using effectively switch statements but with really rich conceptual pattern matching and capturing - so somewhere in the middle of the stack it has excellent abstracting from word input to general concept (in WordNet), performing all the spell correction (according to your dictionary), stem, lem, and disambiguation.
I've had it in mind for a while to build a fuzzy search tool based on parsing each phrase into concepts, parsing the search query into concepts, and finding nearest match based on that. It's a C library and very fast.
Looks like it hasn't been committed to in some time, I'll have to check out their blog and see what's up. I guess with the advent of LLMs, dialog trees are passé.
We started with making the core search technology faster. Then we added a Unicode character folding/normalization tokenizer (diacritics, accents, umlauts, bold, italic, full-width chars...). Last week we added a tokenizer that supports Chinese word segmentation. Currently, we are working on a multi-language tokenizer, that segments Chinese, Japanese an Korean without switching the tokenizer.
I hope the folding and normalization is configurable by language. I really hate it when some search decides that a and ä are the same letter. In Finnish they really aren't; "saari" is an island, "sääri" is the lower leg or shin.
Currently, you can choose between tokenizers with or without folding. But configurability per language or full customizability of the folding logic by the user is a good idea.
I've had it in mind for a while to build a fuzzy search tool based on parsing each phrase into concepts, parsing the search query into concepts, and finding nearest match based on that. It's a C library and very fast.
https://github.com/ChatScript/ChatScript
Looks like it hasn't been committed to in some time, I'll have to check out their blog and see what's up. I guess with the advent of LLMs, dialog trees are passé.