Hacker News new | ask | show | jobs
by mattcaywood 3758 days ago
Data scientists blog about Caltrain data, come up with convoluted hypothesis about bias in sensors at two stations.

Commenter on blog notices that Caltrain is occasionally single-tracking between those stations due to a bridge replacement. [1]

"Data science" ends up with a bloody neck from Occam's razor.

[1] http://www.caltrain.com/projectsplans/Projects/Caltrain_Capi...

4 comments

There are a couple of reasons that this isn't the explanation.

To pick one, the random data selection in the blog post showed data from October 2015 -- Feb 2016, and this Caltrain link appears to show the bridge work starting Feb 26th, 2016.

So, no the just-so story doesn't appear to be just-so.

The data is very much consistent with the bridge work hypothesis. The website indicates a series of bridges are being replaced. Starting 9/28/15 with Tilton Ave. and proceeding northward to Monte Diablo Ave then Santa Inez Ave.
The data was sound. They formed a hypothesis. They got a better explanation. Now they have a better hypothesis for the data anamoly. I don't think it's fair to fault them or to discredit the effort.
I'm criticizing the approach of starting the analysis in a vacuum and coming up with a hypothesis that fits the abstract pattern, when simple domain knowledge (from looking at a website, or a Caltrain station notice board) would have put them on the "right track" from the start.

What the authors did falls into the trap of a very stereotypical criticism of data science and doesn't do data science any favors.

Single-tracking should have a symmetric effect, i.e. the distribution of delay for both northbound and southbound trains through the track segment containing the bridge repair should both be positive. Instead, the southbound (and to a lesser extent the northbound) trains traveling through the segment actually have "negative delay"--completely inconsistent with the alternative hypothesis. You don't need data science to show that single-tracking doesn't explain the observed data--just common sense.
'"Data science" ends up with a bloody neck from Occam's razor.'

Heh, I like the imagery. I wouldn't really call it Data Science with capital letters either. It's just plain-old exploratory analysis.

I wouldn't say the hypothesis is convoluted, especially since we see bizarre data coming out of Caltrain API at times, and experience has taught us to distrust our instruments when observing strong systematic bias. But single-tracking due to construction IS obviously a better explanation... because it's probably true.

What is still confusing is that the train is delayed several minutes, but then recovers at the very next station. That is odd behavior for a train to 'make up' that much time so quickly. Typical behavior is just to keep getting later.

Or maybe Occam will strike again in the comments with an obviously correct explanation!

Train companies often put "buffer time" in schedules to make up for predictable delays like this one. That could show up as a train being late at one station then suddenly on time at the next. Could that be it?