| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by saeranv 15 days ago

> Why is the 'true process' changing here? I understand our best guess or model is changing with new observations, but the true process should not be changing. If it actually is, then the formulation should be changed to isolate the parameters that is feeding back to it.

He's not saying the true process is changing, just the functions that are being sampled from the GP. The true process refers to the true, underlying function so it's deterministic if you have correctly identified all its inputs.

> So is the shape of each function changing?

Yes, the function changes shape as you get more data because the parameters governing that function (that we define in the kernel) are updated with new observational data, so that over time it converges to the 'true' process/function we are trying to discover.

> What is the 'distribution' over the functions doing? Is that also changing? Is the said 'distribution' just flat mean of these functions?

I think you're confused because the example given with cheese is really confusing when we're trying to understand the functions as arising from a multivariate distribution. So, I'll try to clarify that part. GPs are typically used to represent some function where the input is time or distance. This is why its called a 'process' - because the variables in a random process are indexed by space or time. So in this 1D example, in the X domain, [x1, x2, x3] represents something like fixed increments of increasing cheese. f(X) represents the gold amount. Now imagine gold can take any value from 0-100. Now plot all possible values of f(x1) on the x axis of a grid, f(x2) on the y-axis of the grid, and f(x3) on the z-axis of the grid. We have 100^3 points in this 3D grid. If we select one point, it's x,y,z coordinates correspond to the f(x1), f(x2) and f(x3) gold amounts. The dimension index, corresponds (typically) to something like time, or distance. In this example it's cheese.

In a GP, we're modeling the sampled f(X) point as if its from a 3D multivariate normal distribution. So sampling one point gives us the gold amount for cheese amount 1, 2, and 3. This is the 'function', and as we sample more points, we get more 'functions' that give us varying gold amounts for cheese amount 1, 2, and 3. And because it's a multivariate distribution, we can capture correlations between dimensions, so the amount of gold you get for cheese-1, should influence how much gold you get at cheese-2 because its close by. This relationship is defined by the covariance function of the gaussian.

> GP(m(x), k(x, x')) What is 'x' here? (Sigh! We need to learn to define the variables before using.) I can infer that x' is not derivative of x.

x refers to some amount of gold, and k(x, x') just means that the kernel consumes any two values in our X vector (i.e. [x1, x3] or [x1, x2]).

> "In the context of GPs, a kernel or covariance function k(x, x') = Cov(f(x), f(x')), encodes which function values should vary together." It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.

I believe it is the same f actually. He's saying the kernel function takes in two values of x (cheese), and outputs the covariance between their output gold amounts. This illustrates his previous point that the "closeness" between x values should be reflected in the gold amounts.

> The plots now have y and x, and x1 and x2. How are these related?

y is gold. x is cheese. x1, x2 correspond to the first two x-values in the linear plot.

> And with k(x, x') = Cov(f(x), f(x')), what is 'f' for the various kernel functions being plotted.

f(X) is the approximation of the "true" process we're trying to learn from observational data. The observations are tuples of cheese and gold amoutns, so f(x), f(x') is just the corresponding gold amount, we don't actually model that function explicitly. The gaussian distribution we are sampling from for functions just models correlations between our variables, so it represents the function implicitly.

1 comments

alok-g 14 days ago

Thanks. I read several times, and along with another response, I think I have a better understanding now, though still not having a complete grasp.

>> So sampling one point gives us the gold amount for cheese amount 1, 2, and 3. This is the 'function', and ...

I get this part, so each point in this N-dimensional space yields a function f of the index, and this is the function.

>> Yes, the function changes shape as you get more data because the parameters governing that function

Getting more data should now get more such points (in N-dimensional space), but with each such point being the 'function' how is it changing shape.

Nevertheless, I think I have much better glimpses after reading your and other other responses here than from the original article, which I still find confusing even on reading again.

saeranv 14 days ago

I said before that the function shape changes as you're updating the parameters that govern the function but that's actually very misleading, (sorry), since the kernel parameters are only indirectly governing the function. What the parameters directly govern is the joint probability distribution P(f(x1), f(x2), ..., f(xn)). So the function f is implicitly defined by how likely the entire sequence of f values are.

So how does it change shape? Well this part is actually something I don't fully grasp myself yet. But I can sketch a crude bayesian interpretation, which is how I think of it. Not completely correct but works as a placeholder until I fully work out the math of updating the parameters.

Basically, from a bayesian perspective we can condition the joint distribution of function outputs as a likelihood conditioned on the kernel parameters theta: p(f(x1), f(x2), ... | theta).

Then we can derive the posterior distribution over theta p(theta | f(x1), f(x2), ...) like so:

p(theta | f(x1), f(x2), ...) ≈ p(f(x1), f(x2), ... | theta) p(theta).

So we fit the theta parameters based on how well it fits the observed data we feed our bayesian model.

FWIW, I recommend chapter 14 of Richard McElreath's Statistical Rethinking for a better introduction of GPs. This article kind of glosses over a lot of the intuition and introductory concepts that you need to really grok it.