Hacker News new | ask | show | jobs
by roadside_picnic 309 days ago
Basically they're incredibly useful for any situation where you have "medium" data where you don't have enough data to properly train a NN (which are very data hungry in practice) but enough data that you're not really exploiting all the information using a more traditional approach.

GPs essentially allow you to get a lot of the power of a NN while also being able to encode a bunch of domain knowledge you have (which is necessary when you don't have enough data for the model to effectively learn that domain knowledge). On top of that, you get variance estimates which are very important for things like forecasting.

The only real draw back to GPs is that they absolutely do not fit into the "fit/predict" paradigm. Properly building a scalable GP takes a more deeper understanding of the model than most cases. The mathematical foundations required to really understand what's happening when you train a sparse GP greatly exceed what is required to understand a NN, and on top of that there is a fair amount of practical insight into kernel development that is required as well. But the payoff is fantastic.

It's worth recognizing that, once you realize that "attention" is really just kernel smoothing, transformers are essentially learning sophisticated stacked kernels, so ultimately share a lot in common with GPs.