Hacker News new | ask | show | jobs
by lmkg 2456 days ago
The variable for Country should have been treated as a categorical variable, but was instead processed as a numeric variable.

This mistake would be downright trivial to make in R. Just declare that Country is a Factor (which is the built-in type for categorical variables), and then throw the data into a library whose attitude towards errors is to coerce everything to numbers until the warnings go away.

Background: Factors in R are the idiomatic way to work with categorical data, and they work somewhat like C-style enums except the variants come from the data rather than a declaration. So if you take a column of strings in a data frame and cast it to a Factor, it will generate a mapping where the first distinct value is coded as 1, the second distinct value is coded as 2, etc. Then it replaces the strings with their integer equivalents, and saves the mapping off to the side.

I forget the exact rules (if there are rules, R is a bit lawless), but it's not very hard to peek under the hood at the underlying numeric representation. Many built-in operations "know" that Factors are different (e.g. regressing against a Factor will create dummy variables for each variant), but it's up to each library author how 'clever' they want to be.

2 comments

This makes the most sense to me. I don't work in the dataframes world but without this explanation it seemed like someone would have to go out of their way to make that error.
Right then...

...strong typing: for or against?

(To be fair even strong typing won't save you if you don't use it. But fuuuuuk, what an error. I noted that paper mentally and would have quoted from it)

Yup, I'm all for extremely strong typing. In 40 years of writing code I can't say I've ever had any real trouble with strong typing other than when dealing with libraries that reinvent wheels. Weak typing, though--nuke it from orbit.