| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rahidz 2456 days ago

"But when they included their categorically-coded country (1 = US, 2 = Canada, and so on) in their models, it was entered not as fixed effects, with dummy variables for all of the countries except one, but as a continuous measure. This treats the variable as a measure of ‘country-ness’ (for example, Canada is twice as much a country as the US) instead of providing the fixed effects they explicitly intended"

How did this not get caught immediately? If I did a study and found out that kids in Zambia are 47 more times as generous as American kids that'd make me instantly suspicious.

Or maybe the reviewers were all Canadian /s

2 comments

TheCoelacanth 2456 days ago

I don't think it's quite as obvious of an error as you are suggesting.

They were trying to correct for scenarios like this: Hypothetically, Canadians are twice as generous as Americans and twice as religious, but religious Canadians are equally generous as non-religious Canadians and religious Americans are equally generous as non-religious Americans. On the surface, it appears that religious people are more generous, but really it's just that Canadians are more generous.

Instead of treating the countries as discrete groupings, they treated them as points on a spectrum with each country being assigned an arbitrary place on the spectrum.

If #3 happened to be China, they would be assuming that people in China should very similar to people in the US and Canada, because 1 vs. 3 on a scale that goes to 200 is hardly any difference at all, but really the numbers are just arbitrary identifiers.

link

tastygreenapple 2456 days ago

I'm a data scientist and this is an incredibly embarrassing 'n00b' error to make. If these researchers were using anything other than deep learning, it's almost certain that each parameter of the model was manually selected. That the author made the mistake is bad, that no one caught the error is a disaster.

link

im3w1l 2456 days ago

What likely happens is that since the country variable is basically random, you will get a model where the country has negligible effect on the prediction.

link