Information About

Kullback-leibler Divergence




Typically ''P'' represents data, observations, or a precise calculated probability distribution. The measure ''Q'' typically represents a theory, a model, a description or an approximation of ''P''.

For probability distributions ''P'' and ''Q'' of a Discrete Random Variable
the K–L divergence (or informally K–L distance) of ''Q'' from ''P'' is defined to be







  :<math> D {KL}(P\Q) \int_X \log rac{dP}{dQ} \ dP






  A Result Known As "http://wwwinformationdelightinfo/information/entry/Gibbs'_inequality" class="copylinks">Gibbs' Inequality , with ''D''<sub>KL</sub>(''P''''Q'') zero if and only if ''P''&nbsp=&nbsp''Q'' The entropy ''H(P)'' thus sets a minimum value for the cross-entropy ''H(P,Q)'', the expected number of bits required when using a code based on ''Q'' rather than ''P'' and the KL divergence therefore represents the expected number of extra bits that must be transmitted to identify a value ''x'' drawn from ''X'', if a code is used corresponding to the probability distribution ''Q'', rather than the "true" distribution ''P''




  Moreover, ''D''<sub>KL</sub>(''P''''Q'') Does Not Satisfy The "http://wwwinformationdelightinfo/information/entry/triangle_inequality" class="copylinks">Triangle Inequality
















  In "http://wwwinformationdelightinfo/information/entry/Bayesian_statistics" class="copylinks">Bayesian Statistics the KL divergence can be used as a measure of the information gain in moving from a Prior Distribution to a Posterior Distribution If some new fact ''Y=y'' is discovered, it can be used to update the probability distribution for ''X'' from ''p''(''x''I) to a new posterior probability distribution ''p''(''x''''y'') using Bayes' Theorem :
  :<math>p(xy) rac{p(yx) p(xI)}{p(yI)}</math>
  This Distribution Has A New Entropy, ''H(p(xy))'' -&sum ''p''(''x''''y'') log ''p''(''x''''y''), which may be less than or greater than the original entropy ''H(p(xI))''


  If A Further Piece Of Data, ''Y''<sub>2</sub> ''y''<sub>2</sub>, subsequently comes in, the probability distribution for ''x'' can be updated further, to give a new best guess ''p''(''x''''y''<sub>1</sub>,''y''<sub>2</sub>) If one reinvestigates the information gain for using ''p''(''x''''y''<sub>1</sub>) rather than ''p''(''x''I), it turns out that it may be either greater or less than previously estimated:
  :<math>\sum X P(xy 1,y 2) \log Rac{p(xy 1,y 2)}{p(xI)}</math> May Be < or > than <math>\sum_x p(xy_1) \log rac{p(xy_1)}{p(xI)}</math>


  The Kullback–Leibler Divergence ''D''<sub>KL</sub>( ''p''(''x''''H''<sub>1</sub>) ''p''(''x''''H''<sub>0</sub>) ) Can Also Be Interpreted As The Expected '''discrimination Information''' For ''H''<sub>1</sub> Over ''H''<sub>0</sub>: The Mean Information Per Sample For Discriminating In Favour Of A Hypothesis ''H''<sub>1</sub> Against A Hypothesis ''H''<sub>0</sub>, When Hypothesis ''H''<sub>1</sub> Is True Another Name For This Quantity, Given To It By "http://wwwinformationdelightinfo/information/entry/IJ_Good" class="copylinks">IJ Good , is the expected Weight Of Evidence for ''H''<sub>1</sub> over ''H''<sub>0</sub> to be expected from each sample
  :''IG'' ''D''<sub>KL</sub>( ''p''(''H''x) ''p''(''H''I) ) <math>
  :<math>D \mathrm{KL}(q(xa)u(a)p(x,a)) \mathbb{E}_{u(a)}\{D_\mathrm{KL}(q(xa)p(xa))\} + D_\mathrm{KL}(u(a)p(a)),</math>
  Ie The Sum Of The KL Divergence Of ''p''(''a'') The Prior Distribution For ''a'' From The Updated Distribution ''u''(''a''), Plus The Expected Value (using The Probability Distribution ''u''(''a'')) Of The KL Divergence Of The Prior Conditional Distribution ''p''(''x''''a'') From The New Conditional Distribution ''q''(''x''''a'') This Is Minimised If ''q''(''x''''a'') ''p''(''x''''a'') over the whole support of ''u''(''a'') and we note that this result incorporates Bayes' theorem, if the new distribution ''u''(''a'') is in fact a &delta function representing certainty that ''a'' has one particular value










  :<math>H^2(P,Q) rac{1}{2}\int \sqrt{dP} - \sqrt{dQ}^2 </math>



Taking expectations with respect to Q, we get