| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sachuin23 137 days ago

I have been working on a problem most language detection libraries quietly fail at: short, messy, conversational text. The kind you see in chat apps, support tickets, SMS, and mixed-language messages.

FastLangML is my attempt to fix that.

It is a multi-backend ensemble (FastText, Lingua, langdetect, pyCLD3, and others) with a voting layer built for real-world text. It handles:

Short messages with almost no statistical signal

Code switching like Hinglish or Spanglish

Slang, abbreviations, and emojis

Multi-turn conversations where context matters

Confusable languages like ES vs PT or NO vs DK vs SV

A few design choices:

Context-aware detection so you can pass conversation history and get more stable predictions

A hinting system for slang, abbreviations, and custom rules

Extensible backends so you can plug in your own detectors or voting logic

Optional persistence using Redis or disk for multi-turn conversations

Support for more than 170 languages across the ensemble

Why I built it: most detectors are tuned for long, clean text. They break on "ok", "jaja", "mdr", "brooo", or anything with mixed languages. I needed something that works on real chat data, not idealized text.

I would love feedback from HN on:

How you evaluate language detection quality in production

Whether context-aware detection helps in your workflows

Ideas for improving code switching accuracy

Additional backends worth integrating

Repo: https://github.com/pnrajan/FastLangML

Happy to share benchmarks, architecture notes, or design tradeoffs if people are interested.