|
|
|
|
|
by bencw
2242 days ago
|
|
A common type of example involves relatively small or uninformative datasets. Say you flip a coin a few times and only get heads. Your maximum likelihood (frequentist) estimate is that the coin will always land heads. In a Bayesian setting, if you have a (say uniform) prior on the probability that the coin lands heads, your maximum a posteriori estimate of this probability will be non-zero, but will get continue to get smaller if you continue only seeing heads. The above example is contrived, but makes more sense in the case of language modelling. Since a bag-of-words vector, containing say counts of words seen in a document, is typically sparse (most documents only contain a limited portion of the full vocabulary), a frequentist estimate of word probability will say that certain words can never occur, just because it's never seen them. The Bayesian estimate will still assign some non-zero chance of seeing that word. Practically speaking, this leads to the idea of "smoothing" in tf-idf (text-frequency-inverse-document-frequency) vectors, by adding 1 to document frequencies. You don't need Bayesian statistics to do this, but maybe you never would have thought of it otherwise! |
|
Not quite. If you have a uniform prior, there will be no difference between MAP and MLE.
>From the vantage point of Bayesian inference, MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters.
https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
More discussion here:
https://stats.stackexchange.com/questions/64259/how-does-a-u...