| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by humanarity 4085 days ago

"The cluster centers found by STS clustering on any particular run of k-means on stock market dataset are not significantly more similar to each other than they are to cluster centers taken from random walk data! In other words, if we were asked to perform clustering on a particular stock market dataset, we could reuse an old clustering obtained from random walk data, and no one could tell the difference."

"As the sliding window passes by, the datapoint first appears as the rightmost value in the window, then it goes on to appear exactly once in every possible location within the sliding window. So the t_i datapoint contribution to the overall shape is the same everywhere..."

"Another way to look at it is that every value v_i in the mean vector, 1 ≤ i ≤ w, is computed by averaging essentially every value in the original time series; more precisely, from t_i to t_m-w+i . So for a time series of m = 1024 and w = 32, the first value in the mean vector is the average of t[1..993]; the second value is the average of t[2…994], and so forth. Again, the only datapoints not being included in every computation are the ones at the very beginning and at the very end, and their effects are negligible asymptotically."

1 comments

mathgenius 4085 days ago

I wonder if the same effect applies to an exponential sliding window, ie. weight more "recent" samples higher than older samples, so effectively it's an infinite window. Instead of using a hard cutoff window.

link

humanarity 4085 days ago

Hmm, good question. I'd say this would simply distribute the contribution of each window member along an exponential (rather than constant) trend. The terms in the mean vector would then be polynomials of the window members. Even if you lined up t_(x^n) weight with t_(x^m), it seems to me the other t_i filtered at exponential distances wouldn't line up. You'd could probably arrange it to get split on these nonidentical term weights, and maybe this could work as a (somewhat involved) method of feature selection that didn't degenerate to identical contributions. Interesting idea! - and maybe there's a few papers about it somewhere :)

link