| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ahstilde 252 days ago
	why is spanish so distributed?

2 comments

ilyausorov 252 days ago

Good question! It's likely because there are lots of different accents of Spanish that are distinct from each other. Our labels only capture the native language of the speaker right now, so they're all grouped together but it's definitely on our to-do list to go deeper into the sub accents of each language family!

link

bikeshaving 252 days ago

Spanish is one of those languages I would love to see as a breakdown by country. I’m sure Chilean Spanish looks very different from Catalonian Spanish.

link

rkomorn 252 days ago

Did you mean Catalan (which is not Spanish) or Castilian Spanish?

link

bikeshaving 252 days ago

Yes the Spanish spoken in Spain, especially the one that’s like /ˈɡɾaθjas/ and /baɾθeˈlona/.

link

djmips 252 days ago

But Spanish sounds very different in Spain depending on what region of the country you are talking about.

link

david-gpu 251 days ago

Yeah, and not all Spaniards have a distinct pronunciation for "c" and "s". For those curious: https://en.wikipedia.org/wiki/Phonological_history_of_Spanis...

link

oscarfree 252 days ago

Not sure, could be the large number of Spanish dialects represented in the dataset, label noise, or something else. There may just be too much diversity in the class to fit neatly in a cluster.

Also, the training dataset is highly imbalanced and Spanish is the most common class, so the model predicts it as a sort of default when it isn't confident -- this could lead to artifacts in the reduced 3d space.

link