Hacker News new | ask | show | jobs
by muzani 751 days ago
"So does the CSV have multiple languages within it?"

For context, it's for Indonesia. So you do have urban areas and expats who can mostly speak English. But English isn't taught in school. Arabic, Chinese, Dutch are popular too.

Bahasa Indonesia is the formal version, but most will be more familiar with the various languages. Most states will have their own dialects, at least 5 major dialects. The difference is not like American/English, but closer to Scots/English.

This is a representation of the word fruit in different dialects: https://www.facebook.com/photo/?fbid=2078468635856668

The challenge is that many people are comfortable with writing in their own dialect, even if they can read others. So we can't possibly use all the languages that could be input, but it's fine if it outputs in a somewhat different dialect. For the most part, sentence structures are similar and "bua" and "buah" are still recognizable to people with different dialects, likely to AI as well. But programmatic algorithms wouldn't handle something like "buwa" properly.

To answer the question, the CSV will mostly just have informal Bahasa Indonesia. English questions will miss sometimes, so we've added English as well for the ones that miss. More data is more effective.

"Why a CSV verse various documents in a directory (like Google Drive?)"

Could be anything really, but we tend to have a loose format for data & notes, and then have a script that strips out or combines from other sheets into something that's cleaner to read and search. PDF, XLS, etc is hard to get consistent. But something like JSON, XML, YAML are still acceptable as long as we can script it.

AskJack seems to be a bit low code though. We can build our own and have reasons to have more control over certain bits. So a B2B solution over the RAG part would be nice.