Hacker News new | ask | show | jobs
by highd 3468 days ago
Timeseries can 100% work like that. If you expect your timeseries data to be coming from a similar distribution that is what you do to train your LSTM. It's not just a magic box - you have to train it to encode useful features in the gate.

Sure, you can prefer to use subsets of a single time series instead of multiple time series. The issue remains that it doesn't matter what your performance on training data is. You still need to partition your dataset into training and test data - otherwise you could just be storing a lookup table for all you know. It looks like the author has trained on the entirety of the dataset, and then is just considering that performance...

Let me put this another way. You do this with a random walk. Train on your entire timeseries - every length 50 window. Say that there's only 8 unique values at each timestep. That means that there's 8^50 possible input sequences into the neural network. A sufficiently complex neural network can fit an arbitrary function, so if you just have a couple thousand windows there's ~(8^50 / 1000) possible functions that can predict the correct output exactly - and this is on noise! And in all likelihood the neural net will learn that noise: https://arxiv.org/abs/1611.03530 Without comparing training and test results there's no way to know that neural network learned anything of value - it can get perfect accuracy on training data that's pure noise!

This stuff is really critical to get right if you're doing machine learning.

2 comments

What I was interpreting the parent comment to mean two different subjects. For example, I can't train on weather data from Paris France and then expect it to be able to predict tomorrow's weather in Portland Oregon. Am I wrong on that?
You may be able to do that. That's sort of a matter of preference which you'd like to do. If the two datasets share more structure than it's more advantageous to share the network. There's also a bunch of hybrid approaches, i.e. pretraining on every city and then fine-tune each independently.
When doing train/test splits in a time series context like this, would forward chain backtesting (train on steps 1:n, predict n+x) be enough validation, or would you advocate for further sampling of the training set?
If you're comparing deltas (i.e. x_{n+x} - x_{n+x-1}) that might be sufficient - otherwise it's hard to tell if you're just capturing that x_{n+1} is close to x_{n}. The primary risk would be that you're putting strong structure on the datasets you're testing with, so you could be mislead. Ie what if you have:

  y = sin(t) if 0<t<100Pi

  = sin(2t) if 100Pi<t<200Pi

  = sin(3t) if 200Pi<t<300Pi
Then you could imagine that with simply backtesting the model in front of where you're training you could run into issues - each train iteration might fix a constant frequency in the network and then it looks like it works great over each iteration, but you've never learned how to determine each frequency on-the-fly. If that happens with random backtesting from the dataset the backtesting would show that only 1/3 of the test set is fitting.

The gold standard is always a well-partitioned dataset. And if you're going to hold a meeting describing your results, or deploy a product, it's really important that the results stand up to these sorts of questions.