| Anti-spam bot plugin for messengers: - MVP version for Telegram (since spamming is a part of their business model, it feels natural to start with it) - More precisely, data pipeline for weights and measurements for word frequencies. Think of it as small-language models. - More precisely, it is about morphological analysis of words across different languages. Unlike Meta with regex-based dictionaries[0], I am porting rule-based morphology analysis python library[1] into target programming language. - More precisely, right now it is about understanding DAWG data structure by porting it from C++[2] to Haskell[3]. - Instead of introducing FFI I wanted to become more comfortable with LLMs, I am trying to approach their internals (or my possibly wrong vision of their internals) by building small language model based on a corpus of thousands of spam messages. Links: [0] duckling: https://hackage.haskell.org/package/duckling [1] pymorphy2: https://github.com/pymorphy2/pymorphy2 [2] dawgdic/C++: https://code.google.com/archive/p/dawgdic/ [3] dawgdic/Haskell (work-in-progress): https://github.com/swamp-agr/dawgdic |