|
|
|
|
|
by daemonk
4085 days ago
|
|
So the heart of the problem, from what I can gather, is the sliding window approach to generating subsequences. This approach necessarily generates "redundantly similar" windows, ie. windows will be most similar to their closest neighboring windows since there is an overlapping length of window - 1 The sliding window approach is not adding random noise to the clustering, it is adding redundantly similar data to the clustering resulting in almost perfect sine-wave-like cluster centers. |
|
I am coming from a genomics background, so when I read about sliding window on time series data, I think about sliding window approach on regions of genome sequences.
The ultimate point of doing this clustering is to find repetitive sub-sequences in the series. This is also something we do commonly in the genomics field (repeatmasker/repeatmodeler software for example).
Something we can look at in genomes is a "k-mer coverage". A k-mer is just a k-lengthed sub-sequence (A,T,G,C), analogous to a k-length sub-sequence of a time series.
By scanning the genome, we can tally up how many times a k-mer appears in the genome. With this tally, we can determine a k-mer coverage for a region on the genome. This gives us an idea how many times we see this same region on the rest of the genome.
Maybe this approach can be adapted somehow? The only problem is that in genomics, we have four discrete classes (A,G,T,C), making finding exact matching relatively easy. In time-series, we have continuous data, making finding matches a lot tougher to do and define.