This particular problem pops up in quite a few domains. (We often refer to it as "phase unwrapping" in the geosciences.) The approach here is a good one so long as your noise doesn't result in lots of mistaken "wrap-arounds".
However, it will fail badly in the presence of noise in many cases. It's particularly problematic when the slope changes (e.g. a polynomial) or where the slope is high and the noise is high. (Note that polynomials are still linear in the sense mentioned here: linear regression).
At any rate, this is definitely a nice write-up, but a bit more discussion of where the approach breaks down would be useful. It's actually a classic example of an elegant solution that breaks down frequently in practice (i.e. it's commonly used as a teaching example in various courses). A better solution is usually more complex, domain specific, and therefore out-of-scope, but failure modes for this method make for a nice set of examples.
You’re completely right. I investigated this as a means for solving the phase-unwrapping problem I was working on, and while it worked relatively well, a more domain specific solution was eventually used.
I purposely stayed away from mentioning phase unwrapping as I was trying to make this as accessible as possible without overloading the reader with jargon. My goal was to more show how problem transformation (like the frequency domain) can sometimes make hard problems far simpler (I was also just playing with data visualization). Looking at it now though, I probably should have added in the conclusion some external resources for people who have the background. It definitely wouldn’t have made the piece less readable, and could have added a bit more value.
Also, and this is a major oversight on my part - I was specifically looking at fitting data generated by an affine function, not “linearly fitting data”. How I titled this is definitely confusing.
Part of what interested me in writing about this though is how the discontinuities changed a trivial problem into something a bit trickier. If the data could be generated by more complex functions, then I would have forgone looking for an easy solution (as an aside, the problem I was working on had sharp timing and hardware constraints which kept me from using a more general solution).
Nice work, by the way! I didn't meant to be overly critical or "nit-picky".
The visualization parts of the post are _extremely_ well done, and it's a very clear example of how to apply mathematical tricks to practical problem solving. I feel like things like this should be in most AI practitioner's toolboxes, and they're too often forgotten.
I agree completely on the avoiding jargon point. Jargon is a real hindrance to learning something for the first time, and it's definitely best to minimize it whenever possible. (It is very useful when you're trying to find relevant papers though... I always find it fascinating/frustrating what different fields call the same thing...)
(tldr; use a sine and cosine function regression like a linear regression. Think like solving for a free angle and a free phase instead than for a free bias and weight).
1. Convert the hours to an angle in degrees or in radians (a simple linear transformation).
2. Take the cos and sin of the angle to get the x and y position in a plane, respectively.
3. Introduce a time axis such that the thing doesn't draw a circle but rather an helix (like DNA).
4. So we now have a ton of 3D data points: (time, x, y). Create a ML model to fit a sine and a cosine to those data points to match them perfectly. Your model has only 2 free parameters to optimize for: a shared phase offset and a shared frequency. The sine uses (time, y) and the cosine (time, x).
5. Initialize the model with a random phase offset and a frequency ideally already close to the one you think you have. Don't initialize with a too high frequency to avoid fitting just Nyquist-frequency-close-noise.
6. Optimize! (With the least squares.) I guess that you might congerge only to a local minima and need to try different randon starting frequencies if you fail to converge.
7. The answer to your problem is the now-optimized free parameter of the frequency. It won't sit between two bins of your fft anymore.
Note: This link contains images picturing the transformations I try to explain.
Disclaimer: I didn't do that yet, this is just off the top of my head. If I said something wrong, please comment. Mostly about a wrong convergence to Nyquist freq or something like that (?).
In the end, this way, you won't have discrete fft bins. You'll approach the problem orthogonally to that: you solve for finding the one best fft bin (frequency) directly.
In other words: solve for the content in the exponential of "e" as free parameters, and for one such frequency and phase offset instead of many bins.
I'd apply the window to randomly-sampled mini-batches of consecutive points instead of optimizing the neural network on just randomly-sampled batch points or on all the dataset at once. I guess that using an Hann-Poisson window will make the "gradient" valley easier to "ski down" with gradient descent which is a greedy algorithm. I guess that the spectral leakage caused by the Hann-Poisson window function will make the gradient landscape more monotonically decreasing in every point towards the global minima.
However, it will fail badly in the presence of noise in many cases. It's particularly problematic when the slope changes (e.g. a polynomial) or where the slope is high and the noise is high. (Note that polynomials are still linear in the sense mentioned here: linear regression).
At any rate, this is definitely a nice write-up, but a bit more discussion of where the approach breaks down would be useful. It's actually a classic example of an elegant solution that breaks down frequently in practice (i.e. it's commonly used as a teaching example in various courses). A better solution is usually more complex, domain specific, and therefore out-of-scope, but failure modes for this method make for a nice set of examples.