| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by remram 562 days ago
	What is the story for multi-language corpus? Do I have to do my own stop word pruning, tokenizing, lemming, etc? This is usually the case with full-text search solutions and it is a pain.

2 comments

jazzyjackson 562 days ago

Re: stemming and lemming, I just want to plug the most impressive NLP stack I ever used, "chat script", really it's for building dialog trees where it walks down a branch of conversation using effectively switch statements but with really rich conceptual pattern matching and capturing - so somewhere in the middle of the stack it has excellent abstracting from word input to general concept (in WordNet), performing all the spell correction (according to your dictionary), stem, lem, and disambiguation.

I've had it in mind for a while to build a fuzzy search tool based on parsing each phrase into concepts, parsing the search query into concepts, and finding nearest match based on that. It's a C library and very fast.

https://github.com/ChatScript/ChatScript

Looks like it hasn't been committed to in some time, I'll have to check out their blog and see what's up. I guess with the advent of LLMs, dialog trees are passé.

link

kreyenborgi 562 days ago

Their company home page, http://brilligunderstanding.com/ wow..

link

wolfgarbe 562 days ago

We started with making the core search technology faster. Then we added a Unicode character folding/normalization tokenizer (diacritics, accents, umlauts, bold, italic, full-width chars...). Last week we added a tokenizer that supports Chinese word segmentation. Currently, we are working on a multi-language tokenizer, that segments Chinese, Japanese an Korean without switching the tokenizer.

link

ronjakoi 562 days ago

I hope the folding and normalization is configurable by language. I really hate it when some search decides that a and ä are the same letter. In Finnish they really aren't; "saari" is an island, "sääri" is the lower leg or shin.

link

wolfgarbe 562 days ago

Currently, you can choose between tokenizers with or without folding. But configurability per language or full customizability of the folding logic by the user is a good idea.

link