Linear Regression Article Index for
Linear
Website Links For
Linear
 

Information About

Linear Regression




In Statistics , linear regression is a method of estimating the conditional Expected Value of one variable ''y'' given the values of some other variable or variables ''x''. The variable of interest, ''y'', is conventionally called the " Response Variable ". The terms "endogenous variable" and "output variable" are also used. The other variables ''x'' are called the Explanatory Variable s. The terms "exogenous variables" and "input variables" are also used, along with "predictor variables". The term Independent Variables is sometimes used, but should be avoided as the variables are not neccessarily Statistically Independent . The explanatory and response variables may be Scalar s or Vector s. '' Multiple Regression '' includes cases with more than one explanatory variable.

The term ''explanatory'' variable suggests that its value can be chosen at will, and the ''response'' variable is an effect, i.e., causally dependent on the explanatory variable, as in a Stimulus-response Model . Although many linear regression models are formulated as models of cause and effect, the direction of causation may just as well go the other way, or indeed there need not be any Causal relation at all. For that reason, one may prefer the terms "predictor / response" or "endogenous / exogenous," which do not imply casuality.

''Regression'', in general, is the problem of estimating a conditional expected value.

It is often erroneously thought that the reason the technique is called "linear regression" is that the graph of y = \alpha + \beta x is a line. But in fact, if the model is

: y_i = \alpha + \beta x_i + \gamma x_i^2 + \epsilon_i

(in which case we have put the vector (x_i, x_i^2) in the role formerly played by x_i and the vector (\beta, \gamma) in the role formerly played by \beta), then the problem is still one of linear regression, even though the graph is not a straight line. The rationale for this terminology will be explained below.

Linear regression is called "linear" because the relation of the response to the explanatory variables is assumed to be a Linear Function of some parameters. Regression models which are not a linear function of the parameters are called Nonlinear Regression models. A Neural Network is an example of a nonlinear regression Model .

Still more generally, regression may be viewed as a special case of Density Estimation . The Joint Distribution of the response and explanatory variables can be constructed from the Conditional Distribution of the response variable and the Marginal Distribution of the explanatory variables. In some problems, it is convenient to work in the other direction: from the joint distribution, the conditional distribution of the response variable can be derived.


HISTORICAL REMARKS


The earliest form of linear regression was the Method Of Least Squares , which was published by Legendre in 1805, and by Gauss in 1809. The term "least squares" is from Legendre's term, ''moindres carrés''. However, Gauss claimed that he had known the method since 1795.

Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun. Euler had worked on the same problem (1748) without success. Gauss published a further development of the theory of least squares in 1821, including a version of the Gauss-Markov Theorem .

The term "reversion" was used in the Nineteenth Century to describe a biological phenomenon, namely that the progeny of exceptional individuals tend on average to be less exceptional than their parents, and more like their more distant ancestors. Francis Galton studied this phenomenon, and applied the slightly misleading term " Regression Towards Mediocrity " to it (parents of exceptional individuals also tend on average to be less exceptional than their children). For Galton, regression had only this biological meaning, but his work (1877, 1885) was extended by Karl Pearson and Udny Yule to a more general statistical context (1897, 1903). In the work of Pearson and Yule, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925. Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.


STATEMENT OF THE LINEAR REGRESSION MODEL


A linear regression model is typically stated in the form

: y = \alpha + \beta x + arepsilon. \,

The right hand side may take other forms, but generally comprises a Linear Combination of the parameters, here denoted α and β. The term ε represents the unpredicted or unexplained variation in the response variable; it is conventionally called the "error" whether it is really a Measurement Error or not. The error term is conventionally assumed to have Expected Value equal to zero, as a nonzero expected value could be absorbed into α. See also Errors And Residuals In Statistics ; the difference between an error and a residual is also dealt with below. It is also assumed that arepsilon is independent of x.

An equivalent formulation which explicitly shows the linear regression as a model of conditional expectation is



This statistic is a measure of how well a straight line describes the data. Values near zero suggest that the model is ineffective. ''r''2 is frequently interpreted as the fraction of the variability explained by the explanatory variable, ''X''.


MULTIPLE LINEAR REGRESSION


Linear regression can be extended to functions of two or more variables, for example

: Z = a X + b Y + c + arepsilon. \,

Here ''X'' and ''Y'' are explanatory variables.
The values of the parameters ''a'', ''b'' and ''c'' are estimated by the method Of Least Squares , that minimize the sum of squares of the residuals

: S_r = \sum_{i=1}^n (a x_i + b y_i + c - z_i)^2. \,

Take derivatives with respect to ''a'', ''b'' and ''c'' in S_r and set equal to zero. This leads to a system of three linear equations i.e. the normal equations to find estimates of parameters ''a'', ''b'' and ''c'':

:
\begin{bmatrix}
\sum_{i=1}^n x_i^2 & \sum_{i=1}^n x_i y_i & \sum_{i=1}^n x_i \ \
\sum_{i=1}^n x_i y_i & \sum_{i=1}^n y_i^2 & \sum_{i=1}^n y_i \ \
\sum_{i=1}^n x_i & \sum_{i=1}^n y_i & n \
\end{bmatrix}
\begin{bmatrix}
a \ \
b \ \
c
\end{bmatrix}
=
\begin{bmatrix}
\sum_{i=1}^n x_i z_i \ \
\sum_{i=1}^n y_i z_i \ \
\sum_{i=1}^n z_i
\end{bmatrix}



SCIENTIFIC APPLICATIONS OF REGRESSION


Linear regression is widely used in biological and behavioral sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines. For example, early evidence relating cigarette smoking to mortality came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to insure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized experiments are considered to be more trustworthy than a regression analysis.


SEE ALSO




REFERENCES



Historical




Modern theory



Modern practice



EXTERNAL LINKS