Correlation Article Index for
Correlation
Articles about
Correlation
 

Information About

Correlation




In Probability Theory and Statistics , correlation, also called '''correlation coefficient''', indicates the strength and direction of a linear relationship between two Random Variables . In general statistical usage, ''correlation'' or co-relation refers to the departure of two variables from independence. In this broad sense there are several coefficients, measuring the degree of correlation, adapted to the nature of data.

A number of different coefficients are used for different situations. The best known is the Pearson Product-moment Correlation Coefficient , which is obtained by dividing the Covariance of the two variables by the product of their Standard Deviation s. Despite its name, it was first introduced by Francis Galton .


PEARSON'S PRODUCT-MOMENT COEFFICIENT

See Also: Pearson product-moment correlation coefficient



Mathematical properties

The correlation coefficient ρ''X, Y'' between two Random Variables ''X'' and ''Y'' with Expected Value s μ''X'' and μ''Y'' and Standard Deviation s σ''X'' and σ''Y'' is defined as:

: ho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E((X-\mu_X)(Y-\mu_Y)) \over \sigma_X\sigma_Y},
where ''E'' is the Expected Value operator and cov means Covariance .
Since μ''X'' = E(''X''),
σ''X''2 = E(''X''2) − E2(''X'') and
likewise for ''Y'', we may also write

: ho_{X,Y}= rac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-E^2(X)}~\sqrt{E(Y^2)-E^2(Y)}}.

The correlation is defined only if both of the standard deviations are finite and both of them are nonzero. It is a corollary of the Cauchy-Schwarz Inequality that the correlation cannot exceed 1 in Absolute Value .

The correlation is 1 in the case of an increasing linear relationship, −1 in the case of a decreasing linear relationship, and some value in between in all other cases, indicating the degree of Linear Dependence between the variables. The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.

If the variables are . However, in the special case when ''X'' and ''Y'' are Jointly Normal , independence is equivalent to uncorrelatedness.

A correlation between two variables is diluted in the presence of measurement error around estimates of one or both variables, in which case Disattenuation provides a more accurate coefficient .


The sample correlation


If we have a series of ''n''  measurements of ''X''  and ''Y''  written as ''xi''  and ''yi''  where ''i'' = 1, 2, ..., ''n'', then the Pearson Product-moment Correlation Coefficient can be used to estimate the correlation of ''X''  and ''Y'' . The Pearson coefficient is
also known as the "sample correlation coefficient". It is especially important if ''X''  and ''Y''  are both Normally Distributed . The Pearson correlation coefficient is then the best estimate of the correlation of ''X''  and ''Y'' . The Pearson correlation coefficient is written:

:
r_{xy}= rac{\sum (x_i-\bar{x})(y_i-\bar{y})}{(n-1) s_x s_y},


where \bar{x} and \bar{y} are the sample Mean s of ''X''  and ''Y'' , ''s''''x''  and ''s''''y''  are the sample Standard Deviation s of ''X''  and ''Y''  and the sum is from ''i'' = 1 to ''n''. As with the population correlation, we may rewrite this as

:
r_{xy}= rac{n\sum x_iy_i-\sum x_i\sum y_i}
{\sqrt{n\sum x_i^2-(\sum x_i)^2}~\sqrt{n\sum y_i^2-(\sum y_i)^2}}.


Again, as is true with the population correlation, the absolute value of the sample correlation must be less than or equal to 1. Though the above formula conveniently suggests a single-pass algorithm for calculating sample correlations, it is notorious for its numerical instability (see below for something more accurate).

The square of the sample correlation coefficient, which is also known as the Coefficient Of Determination , is the fraction of the variance in ''yi''  that is accounted for by a linear fit of ''xi''  to ''yi'' . This is written

  Where ''s''<sub>''y''''x''</sub><sup>2</sup>&nbsp Is The Square Of The Error Of A "http://wwwinformationdelightinfo/information/entry/linear_regression" class="copylinks">Linear Regression of ''x<sub>i</sub>''&nbsp on ''y<sub>i</sub>''&nbsp by the Equation ''y = a + bx'':
  :<math>s {yx}^2 rac{1}{n-1}\sum_{i=1}^n (y_i-a-bx_i)^2,</math>
  :<math>r {xy}^2 1- rac{s_{xy}^2}{s_x^2}</math>
  :<math>r^2 1- rac{\sigma_{zxy}^2}{s_z^2}</math>
  <!-- Cos Theta (X dot Y) / X Y = 293 / sqrt(103 00983) = 0920814711 -->
  :<math> \cos Heta rac { \bold{x} \cdot \bold{y} } { \left\ \bold{x} ight\ \left\ \bold{y} ight\ } = rac { 293 } { \sqrt { 103 } \sqrt { 00983 } } = 0920814711 </math>
  <!-- Cos Theta (X dot Y) / X Y = 0308 / sqrt(308 000308) = 1 -->
  :<math> \cos Heta rac { \bold{x} \cdot \bold{y} } { \left\ \bold{x} ight\ \left\ \bold{y} ight\ } = rac { 0308 } { \sqrt { 308 } \sqrt { 000308 } } = 1, </math>
  {class "wikitable" align="right"


where an exponent of -1/2 represents the Matrix Square Root of the Inverse of a matrix. The covariance matrix of ''T'' will be the identity matrix. If a new data sample ''x'' is a row vector of ''n'' elements, then the same transform can be applied to ''x'' to get the transformed vectors ''d'' and ''t'':

:d = x - rac{1}{m} Z_{1,m} X
:t = d (D^T D)^{- rac{1}{2}}.


COMMON MISCONCEPTIONS ABOUT CORRELATION



Correlation and causality


The conventional dictum that " Correlation Does Not Imply Causation " means that correlation cannot be validly used to infer a causal relationship between the variables. This dictum should not be taken to mean that correlations cannot indicate causal relations. However, the causes underlying the correlation, if any, may be indirect and unknown. Consequently, establishing a correlation between two variables is a not sufficient condition to establish a causal relationship (in either direction).

Here is a simple example: hot weather may cause both crime and ice-cream purchases. Therefore crime is correlated with ice-cream purchases. But crime does not cause ice-cream purchases and ice-cream purchases do not cause crime.

Another example is a study which showed that male coffee drinkers had an increased risk of pulmonary cancer. It does not necessary follow that drinking coffee can cause lung cancer in males. In fact, further examination revealed a strong correlation between coffee drinking and smoking tobacco in males.

A correlation between age and height in children is fairly causally transparent, but a correlation between mood and health in people is less so. Does improved mood lead to improved health? Or does good health lead to good mood? Or does some other factor underlie both? Or is it pure coincidence? In other words, a correlation can be taken as evidence for a possible causal relationship, but cannot indicate what the causal relationship, if any, might be.


Correlation and linearity


While Pearson correlation indicates the strength of a linear relationship between two variables, its value alone may not be sufficient to evaluate this relationship, especially in the case where the assumption of normality is incorrect.

The image on the right shows Scatterplot s of Anscombe's Quartet , a set of four different pairs of variables created by Francis Anscombe .Anscombe, Francis J. (1973) Graphs in statistical analysis. ''American Statistician'', 27, 17–21. The four y variables have the same mean (7.5), standard deviation (4.12), correlation (0.81) and regression line (y = 3 + 0.5x). However, as can be seen on the plots, the distribution of the variables is very different. The first one (top left) seems to be distributed normally, and corresponds to what one would expect when considering two variables correlated and following the assumption of normality. The second one (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear, and the Pearson correlation coefficient is not relevant. In the third case (bottom left), the linear relationship is perfect, except for one Outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.81. Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.

These examples indicate that the correlation coefficient, as a summary statistic, cannot replace the individual examination of the data.


COMPUTING CORRELATION ACCURATELY IN A SINGLE PASS

The following algorithm (in Pseudocode ) will estimate correlation with good numerical stability

sum_sq_x = 0
sum_sq_y = 0
sum_coproduct = 0
mean_x = x {Link without Title}
mean_y = y {Link without Title}
for i in 2 to N:
sweep = (i - 1.0) / i
delta_x = x {Link without Title} - mean_x
delta_y = y {Link without Title} - mean_y

mean_y += delta_y / i
pop_sd_x = sqrt( sum_sq_x / N )
pop_sd_y = sqrt( sum_sq_y / N )
cov_x_y = sum_coproduct / N

For an enlightening experiment, check the correlation of {900,000,000 + i for i=1...100} with {900,000,000 - i for i=1...100}, perhaps with a few values modified. Poor algorithms will fail.


SEE ALSO



NOTES AND REFERENCES



FURTHER READING



EXTERNAL LINKS