Hacker News new | ask | show | jobs
by daemonk 4085 days ago
So the heart of the problem, from what I can gather, is the sliding window approach to generating subsequences. This approach necessarily generates "redundantly similar" windows, ie. windows will be most similar to their closest neighboring windows since there is an overlapping length of window - 1

The sliding window approach is not adding random noise to the clustering, it is adding redundantly similar data to the clustering resulting in almost perfect sine-wave-like cluster centers.

1 comments

update to my original post

I am coming from a genomics background, so when I read about sliding window on time series data, I think about sliding window approach on regions of genome sequences.

The ultimate point of doing this clustering is to find repetitive sub-sequences in the series. This is also something we do commonly in the genomics field (repeatmasker/repeatmodeler software for example).

Something we can look at in genomes is a "k-mer coverage". A k-mer is just a k-lengthed sub-sequence (A,T,G,C), analogous to a k-length sub-sequence of a time series.

By scanning the genome, we can tally up how many times a k-mer appears in the genome. With this tally, we can determine a k-mer coverage for a region on the genome. This gives us an idea how many times we see this same region on the rest of the genome.

Maybe this approach can be adapted somehow? The only problem is that in genomics, we have four discrete classes (A,G,T,C), making finding exact matching relatively easy. In time-series, we have continuous data, making finding matches a lot tougher to do and define.

Disclaimer: I am not defending or denying the claims in the paper. I have a philosophical interest in time.[1]

The difference I think is that the series in a genome is lexicographic rather than temporal: i.e. which end of a particular strand you read from is irrelevant. On the other hand, a time series has an independent ordering. By analogy, a time series is a directed graph a genome is undirected.

That is the algorithm for finding a subsequence in a time series can be used for finding a subsequence in a genome. But the meaning of a subsequence is different. It's:

  A - G - A - C
versus

  C-> A -> G -> A.
Time series data contains implication and causality. Any claim about time series data is implicitly a claim about causality: e.g. we might claim a time series appears to be random. We don't really talk about a genome's amino acid sequence being random because the causality (other than the trivial case of analogues) lies outside the sequence - the reason for

  T - A - C
is assumed to lie outside of T - A - C.

[1]: http://plato.stanford.edu/entries/kant-hume-causality/#TimDe...

This is pretty much the main idea behind another paper that Eamonn wrote [0]. The idea is that you can normalize and discretize timeseries data into a lecicographical representation, then perform all sorts of interests analyses on it. For example, searching for common subsequences can be done via regular expressions.

[0] http://www.cs.ucr.edu/~eamonn/iSAX.pdf