|
|
|
|
|
by standevbob
2003 days ago
|
|
That's right---Stan doesn't have any online learning facilities. It's very hard to approximate posteriors and chain them, so we don't try. If by "big data", we're talking about too big to fit in memory, that's right. Stan's fully in-memory. Compute can be distributed and GPU-powered for matrix ops, but all of the data and parameters and the core autodiff expression graph need to fit in memory. For "medium data", Stan's adaptive Hamiltonian Monte Carlo sampling is much more efficient and scalable to complex models and higher dimensions than Gibbs or Metropolis. I'm fitting a Covid prevalence model using a custom trend-following and mean-reverting second-order autoregression model over 400 distinct regions with weekly data that has 5M data points and 10K parameters and adjusts for sensitivity and specificity of various tests taken. It fits in a single thread using MCMC in 24 hours or so, but we can fit the model with variational inference in a couple minutes. Although variational inference often produces reasonable point estimates in bigger data settings, it doesn't reasonably quantify uncertainty. I'm also working on a genomics model for differential expression of splice variants that involves 120K measurements and just as many parameters to deal with overdispersion of biological replicates in a control and treatment group. We're using variational inference and it fits in a couple minutes for the comparitiver event probabilities we need to estimate. |
|