Homework on homework page
Correlation and Regression
Correlation shows the relation between two variables, e.g. X and Y
Regression takes this one step further. It predicts or estimates a score for an individual on one variable based his/her/its score on the other variable, and on the correlation between the two variables.Usually in a bivariate (two variables) situation one uses a regression line to predict Y from X. If an individual scores Xi on variable X, what is his/her/its most likely score on Y? One can use the same data to predict X from Y. The equation to do so is analogous, but it is not the same equation.
Linear Regression
Regression is used to predict the score for an individual on one variable, Y, based on his or her score on the other, X.
For linear regression, we compute (and plot) the best-fit straight line between the two variables depicted in a scatterplot. The best fit line is defined as the line that "minimizes the squared deviations from the line."
The Y value of the regression line for an X value is the predicted score for anyone who scored that X. Most individuals who score a given X do not find their Y value exactly on the line, but the Y on the line is the best prediction because it leads to the smallest average error based on all of the X values in the set.
Regression Line
The form of a linear equation is Y=a+bX. If we know the values of a and b we have the equation. We see that Regression is mathematically closely related to correlation because statistics used to compute correlation are also used in regression.
In the regression equation:, and
![]()
The five parameters of the line,
,
,
, and
are the same as in a correlation between the two sets of scores. This relation between correlation and regression is often used to compute the regression line, but there are also other ways to compute a and b.
Linear equations
Here is an example of linear equation, Y = 0.5X + 2, plotted on a rectangular coordinate system. Every point on the plane that satisfies this equation falls on the line. Substitute any value for X in the equation and solve for Y. The X, Y pair will fall exactly on this line. The line could in principle be extended in both directions for as far as necessary, thus X = 2547 gives Y = 1275.5 and X= -30 gives Y = -13: both of these points fall on the extended line.
In the general linear equation, Y=a+bX,![]()
. Scatter plot
and
. I think of it as the Sum of the crossproducts, SP, divided by the sum
of squares in X, SSx . [These formulae
are used to estimate the Y values when given the X values. To estimate X
values from Y values reverse the X and Y. Wherever there is an X in a formula
substitute a Y and vice versa.] Let's compute a regression equation:
Interpretation of regression
What is the value of using regression for prediction?
I allows you to predict an individual's score on one measure much more accurately
than you could otherwise. Regression allows practitioners and researchers
to infer how an individual will score on some measure based on the analysis
of a sample. Thus the interpretation of regression has a home in inferential
statistics.
If you want to predict what some individual's Y
value will be, and you know nothing about him/her/it your best prediction
is
, the mean of the Y distribution. If the relationship between Y and X is
a strong one, and you know the individual's X value, using the regression
line you can predict the Y value much more accurately. The regression equation
gives you a valid basis for predicting different Y values for different individuals.
The predicted scores will fall closer to the actual scores. Mathematically,
the average distance between your predicted values and the actual scores
is much smaller than it would be if you did not use the regression line.
How much smaller is a function of
. The larger
is the smaller the error, and the better the prediction.
There are measures associated with regression that
identify how good the prediction is. These are based on analogies of the
variance and the standard deviation in univariate statistics.
The error variance is essentially the best estimate
of the average of the squared deviations of scores from the regression line.
Its definitional formula is
. The positive square root of the error variance sy|x is called
the standard error of estimate,
. The smaller these measures are, the greater the accuracy of estimate.
The standard error can be used in much the same way that a standard deviation
can be used. If the samples are relatively large and approximately normally
distributed, we can use t distributions (analogous to z distributions) to
figure out how many of our predictions fall more than a certain distance
from the mean.
The total variablility in Y can be thought of as
having two components. We have been discussing that part you cannot predict
(the error variability) because the actual value does not lie on the regression
line. In addition, part of the variability you can predict by using the regression
equation. Y' changes with different Xs. This variability in the predicted
scores is a part of the variability in Y. We can represent that variability
by
. It is defined similarly to other SSs.
.
There is an important and useful relationship between
the SSs.
.
These formulae help to explain conceptually
what the regression line does. They are almost never used for computation.
There are much easier ways of computing them.
The most often computations use r or more precisely
in relation to
.
The variance is basically the average of the squared
deviations from the mean. The error variance is basically the average of
the squared deviations from the regression line. The standard deviation is
the square root of the variance. The standard error is the square root of
the error variance. If we find the equivalent of the sum of squares of the
deviations from the regression line, the rest should be easy.
and
are analogous to
.
is the sum of squares around the regression line.
is the sum of squares of the regression line.
.
, and
.
, but this formula is not used very much although
is often used as the proportion of the variability in Y accounted for
by X.
Our text uses a direct formula (p 142) to compute
sy|x from the sums of squares of X and Y and the crossproduct
of X and Y.
Regression and correlation.
There is one more relationship
that should be mentioned. The correlation coefficient, r, can be used as
the slope of a regression line. If you convert all of the scores in both X
and Y to their respective z scores and compute a regression line on them,
the result is simple and neat,
. In z score units the slope is r and the line goes through (0,0) so the
constant is zero. By a little algebra you can derive the regression line from
this formula. Also in z score units r2
is both the percent of the variance accounted for and a measure
of the variability of
. In analogous fashion 1- r2 is
a measure of the variability around the regression line
Summary on regression
We have seen how to calculate regression from data
and from five basic statistics.
,
,
,
, and
We have seen how regression can be used for prediction.
Using regression as a frame we have seen how
can be used as a measure of variability in Y accounted for by X and 1-r
2 as a measure of "error" variability
We have seen that Y variability can be separated
into that accounted for by X and that not accounted for by X
We can relate b with r, and we have seen that r
is the slope of the regression line if X and Y are transformed to z-scores
We can find the standard error of estimation using
the standard deviation or sum of squares in Y and the correlation between
X and Y.
Here is a regression game. The effect of adding one additional data point to a regression line . It's from Webster West at the University of South Carolina. Look at the regression equation and how it changes when you put an additional point in different places. How small a correlation can you get? What is the largest it can be?