| I have been working on a problem most language detection libraries quietly fail at: short, messy, conversational text. The kind you see in chat apps, support tickets, SMS, and mixed-language messages. FastLangML is my attempt to fix that. It is a multi-backend ensemble (FastText, Lingua, langdetect, pyCLD3, and others) with a voting layer built for real-world text. It handles: Short messages with almost no statistical signal Code switching like Hinglish or Spanglish Slang, abbreviations, and emojis Multi-turn conversations where context matters Confusable languages like ES vs PT or NO vs DK vs SV A few design choices: Context-aware detection so you can pass conversation history and get more stable predictions A hinting system for slang, abbreviations, and custom rules Extensible backends so you can plug in your own detectors or voting logic Optional persistence using Redis or disk for multi-turn conversations Support for more than 170 languages across the ensemble Why I built it: most detectors are tuned for long, clean text. They break on "ok", "jaja", "mdr", "brooo", or anything with mixed languages. I needed something that works on real chat data, not idealized text. I would love feedback from HN on: How you evaluate language detection quality in production Whether context-aware detection helps in your workflows Ideas for improving code switching accuracy Additional backends worth integrating Repo: https://github.com/pnrajan/FastLangML Happy to share benchmarks, architecture notes, or design tradeoffs if people are interested. |