Hacker News new | ask | show | jobs
by jrd79 1549 days ago
A one-dimensional affine fit (usually called a linear fit) contains two parameters: a slope and an offset. Both have error bounds, and the offset error bounds on this data would be huge. Data presentation that is not intended to deceive would have shown the vertical spread of the estimate too. But that spread would have been so wide that it would reveal that the fit is terrible and that reasonable conclusions cannot be drawn from these model fits. This is not scientific work. It is ideological policy advocacy dressed up as data science.
1 comments

I think you are conflating two different things. The R^2 is incredibly poor for all the plots. That's essentially what you are complaining about. However, you can still have a very low R^2 but a statistically significant slope. While I agree that the article needs more support due to only doing univariate analysis and potentially missing huge confounders, your complaint is not valid. Here is code for basically random points that have a tight slope coefficient:

import numpy as np

import statsmodels.api as sm

n = 1000

desired_R2 = 0.05

mu = 0

sigma_noise = 0.1

sigma = np.sqrt(sigma_noise*2*(desired_R2/(1-desired_R2)))

X = np.random.normal(mu, sigma, n)

noise = np.random.normal(0, sigma_noise, n)

y = X + noise

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

model.summary()

No, I'm complaining that they plotted slope-only 95th percentile error bounds, which is visually deceptive. If they had plotted the vertical spread as well, it would have been clear to the reader that the model they chose to explain the data (a linear regression model), does not fit the data at all. That is also clear from the R^2 values, but that is hidden in a difficult-to-interpret numerical value. So the model is not well suited to modelling the data and all conclusions drawn from that model are unfounded. The slope value and its confidence interval are essentially meaningless because the data is not actually modelled by an affine model, so it is nonsensical to talk about the slope estimate and its uncertainty, as the data is not describable by a slope. Models must fit the data well enough to be plausible in order to be useful aids to understanding the data. These models don't come even close to that standard and should not have been used. Any data scientist worth their salt knows this. The authors either know this and went ahead with it anyway, in which case they are dishonest. Or they don't know this, in which case they should not be using such methods, as their incompetence is made plain for the world to see.
But you can literally see the spread of the data points. I think you are complaining about nothing and maybe not understanding the article. They are showing plots with low R^2 and claiming these variables in a vacuum do not predict homelessness (which is in agreement with what you are saying). Then they present two plots, having much higher (but still low) R^2 and make the innocuous claim that median rent is the single greatest predictor that they found. Of course, this analysis is very simple and any model trying to explain homelessness should contain many variables including things like the climate of the city. But to complain that the confidence interval of the slope is uninterpretable is silly. Any data scientist worth their salt understands this is simply a visual representation of the confidence interval outputted by the regression.
You are avoiding the question of whether it is appropriate to present the results of a linear regression on data that is so poorly explained by a linear relationship.

Random looking balls of data points don't have slopes. It is invalid to perform a linear fit on data that does not derive in large part from a linear generative process. And presenting a fit from a model that is facially absurd to apply is bad data science. Whether or not an informed reader would discount the absurd model fit is not material to whether it is appropriate to present such a fit.

They could have binned the data and plotted percentile bands. They could have used a non-parametric density estimator. There are lots of things they could have done to summarize the data and make some sense of the ball of points. But linear regression with slope error bars is not an appropriate choice. That it is easy to compute linear fits, and that it helped them make their point is not justification.

> linear regression on data that is so poorly explained by a linear relationship.

That is exactly what they are saying. This is from TFA:

> The graphics above demonstrate that variation in rates of homelessness cannot be explained by variation in rates of individual factors such as poverty and mental illness.

They are, in my words, saying "Look at this plot, the x-axis has no bearing on the y-axis. To give you a sense of how bad it is, we fit a line to it and it is exactly 0 useful." I don't know why you are focusing so hard on the plots without reading their words. You are in agreement with TFA. Now, for the plot with R^2 of 0.55, that clearly has some positive relationship to it.

As for your last paragraph, I disagree 100%. They are trying to find an explanatory variable, not "summarize the data". By showing all the points, it is evident there is no relationship. As you have continuously pointed this out, the plot achieved its goal. In my opinion, the line is a nice touch for statisticians to know that no illusions from scaling of the axes are playing tricks.