Typically ''P'' represents data, observations, or a precise calculated probability distribution. The measure ''Q'' typically represents a theory, a model, a description or an approximation of ''P''.
For probability distributions ''P'' and ''Q'' of a Discrete Random Variable
the K–L divergence (or informally ) of ''Q'' from ''P'' is defined to be
|
|   |
\int_X \log rac{dP}{dQ} \ dP
|
|   |
"http://wwwinformationdelightinfo/information/entry/Gibbs'_inequality" class="copylinks">Gibbs' Inequality , with ''D''<sub>KL</sub>(''P''''Q'') zero if and only if ''P''&nbsp=&nbsp''Q'' The entropy ''H(P)'' thus sets a minimum value for the cross-entropy ''H(P,Q)'', the expected number of bits required when using a code based on ''Q'' rather than ''P'' and the KL divergence therefore represents the expected number of extra bits that must be transmitted to identify a value ''x'' drawn from ''X'', if a code is used corresponding to the probability distribution ''Q'', rather than the "true" distribution ''P''
|
|   |
"http://wwwinformationdelightinfo/information/entry/triangle_inequality" class="copylinks">Triangle Inequality
|
|   |
"http://wwwinformationdelightinfo/information/entry/Bayesian_statistics" class="copylinks">Bayesian Statistics the KL divergence can be used as a measure of the information gain in moving from a Prior Distribution to a Posterior Distribution If some new fact ''Y=y'' is discovered, it can be used to update the probability distribution for ''X'' from ''p''(''x''I) to a new posterior probability distribution ''p''(''x''''y'') using Bayes' Theorem :
|
|   |
rac{p(yx) p(xI)}{p(yI)}</math>
|
|   |
-&sum ''p''(''x''''y'') log ''p''(''x''''y''), which may be less than or greater than the original entropy ''H(p(xI))''
|
|   |
''y''<sub>2</sub>, subsequently comes in, the probability distribution for ''x'' can be updated further, to give a new best guess ''p''(''x''''y''<sub>1</sub>,''y''<sub>2</sub>) If one reinvestigates the information gain for using ''p''(''x''''y''<sub>1</sub>) rather than ''p''(''x''I), it turns out that it may be either greater or less than previously estimated:
|
|   |
or > than <math>\sum_x p(xy_1) \log rac{p(xy_1)}{p(xI)}</math>
|
|   |
"http://wwwinformationdelightinfo/information/entry/IJ_Good" class="copylinks">IJ Good , is the expected Weight Of Evidence for ''H''<sub>1</sub> over ''H''<sub>0</sub> to be expected from each sample
|
|   |
''D''<sub>KL</sub>( ''p''(''H''x) ''p''(''H''I) ) <math>
|
|   |
\mathbb{E}_{u(a)}\{D_\mathrm{KL}(q(xa)p(xa))\} + D_\mathrm{KL}(u(a)p(a)),</math>
|
|   |
''p''(''x''''a'') over the whole support of ''u''(''a'') and we note that this result incorporates Bayes' theorem, if the new distribution ''u''(''a'') is in fact a &delta function representing certainty that ''a'' has one particular value
|
|   |
rac{1}{2}\int \sqrt{dP} - \sqrt{dQ}^2 </math>
|
Taking expectations with respect to Q, we get