Psy 207B Introduction to Statistics 
Regression

Erwin Segal
back to syllabus
Homework on homework page
Correlation and Regression
Correlation shows the relation between two variables, e.g. X and Y
Regression takes this one step further. It predicts or estimates a score for an individual on one variable based his/her/its score on the other variable, and on the correlation between the two variables.Usually in a bivariate (two variables) situation one uses a regression line to predict Y from X. If an individual scores Xi on variable X, what is his/her/its most likely score on Y? One can use the same data to predict X from Y. The equation to do so is analogous, but it is not the same equation.
Linear Regression
Regression is used to predict the score for an individual on one variable, Y, based on his or her score on the other, X.
For linear regression, we compute (and plot) the best-fit straight line between the two variables depicted in a scatterplot. The best fit line is defined as the line that "minimizes the squared deviations from the line."
The Y value of the regression line for an X value is the predicted score for anyone who scored that X. Most individuals who score a given X do not find their Y value exactly on the line, but the Y on the line is the best prediction because it leads to the smallest average error based on all of the X values in the set.
Regression Line
The form of a linear equation is Y=a+bX. If we know the values of a and b we have the equation. We see that Regression is mathematically closely related to correlation because statistics used to compute correlation are also used in regression.
In the regression equation:  , and
The five parameters of the line  , and  are the same as in a correlation between the two sets of scores. This relation between correlation and regression is often used to compute the regression line, but there are also other ways to compute a and b.
Linear equations
    Here is an example of linear equation, Y = 0.5X + 2, plotted on a rectangular coordinate system. Every point on the plane that satisfies this equation falls on the line. Substitute any value for X in the equation and solve for Y. The X, Y pair will fall exactly on this line. The line could in principle be extended in both directions for as far as necessary, thus X = 2547 gives Y = 1275.5 and X= -30 gives Y = -13: both of these points fall on the extended line.
In the general linear equation,  Y=a+bX,
a is the called the intercept, it is the value of Y when X is zero. If we solve the above equation for Y when X = 0, we get 2, the value of the constant in the equation.
b is called the slope. It is the amount that Y changes as the value of X goes up one unit, e.g. from 13 to 14, or from .65 to 1.65.
In a linear equation the change in Y is the same  for two equal changes in X anywhere on the line. The change in Y, divided by the change in X is equal to a constant, and that constant is b . Using this notion .

If we solve the above equation for Y when, X = 2, we find Y = 3. Also we know that if X = 0, Y = 2. We thus have two points on the line, (2,3) and  (0,2). If we find the difference between the two Y values and divide that by the difference between the two X values we have (3-2)/(2-0) = 1/2 or .5, which equals b in the above example.
Scatter plot
scatterplot deviations from regression line
Let's discuss the graph of a regression line and the criterion for identifying one. The (X,Y) pairs when plotted are not likely to fall exactly on any straight line on the graph. That is, no straight line will go through all of the points. For regression we want to find the straight line that fits better than any other.

How do we determine "best fit"? There are several possibilities. We will explore only two based on an analogy with the mean and variance. The first idea is to find a line where the sum of the deviations from the line of all the Y values for each (X,Y) pair equals zero. Mathematically,  . (Y' is the Y value of the line where the X of the pair crosses it). The sum does equal zero for the regression line, but the sum of the deviation scores equals zero for more than one line, and some of these lines do not look anything like a best-fit. Instead, we sum all of the squares of the deviation scores,  and find the line that gives the smallest possible value based on these data. This is the regression line.
The set of formulae given above based on the correlation coefficient and the means and standard deviations of the two measures can be used to compute the regression line.
Y=a+bX , with and  .
The constant a is usually computed from b and the means of the two sets of scores. b, however, has its own derivational history and formula. One definition of b is  . I think of it as the Sum of the crossproducts, SP, divided by the sum of squares in X, SSx . [These formulae are used to estimate the Y values when given the X values. To estimate X values from Y values reverse the X and Y. Wherever there is an X in a formula substitute a Y and vice versa.]
The two components of b are not that difficult to compute. You learned the formula for sum of squares, SS, when you computed variances and standard deviations,  , the sum of the products SP, is analogous to this: you simply put in a Y instead of the square in X.  . Compare  to  , and  to 

Let's compute a regression equation:

Interpretation of regression
What is the value of using regression for prediction? I allows you to predict an individual's score on one measure much more accurately than you could otherwise. Regression allows practitioners and researchers to infer how an individual will score on some measure based on the analysis of a sample. Thus the interpretation of regression has a home in inferential statistics.
If you want to predict what some individual's Y value will be, and you know nothing about him/her/it your best prediction is  , the mean of the Y distribution. If the relationship between Y and X is a strong one, and you know the individual's X value, using the regression line you can predict the Y value much more accurately. The regression equation gives you a valid basis for predicting different Y values for different individuals. The predicted scores will fall closer to the actual scores. Mathematically, the average distance between your predicted values and the actual scores is much smaller than it would be if you did not use the regression line. How much smaller is a function of  . The larger  is the smaller the error, and the better the prediction.
There are measures associated with regression that identify how good the prediction is. These are based on analogies of the variance and the standard deviation in univariate statistics.
The error variance is essentially the best estimate of the average of the squared deviations of scores from the regression line. Its definitional formula is  . The positive square root of the error variance sy|x is called the standard error of estimate,  . The smaller these measures are, the greater the accuracy of estimate.  The standard error can be used in much the same way that a standard deviation can be used. If the samples are relatively large and approximately normally distributed, we can use t distributions (analogous to z distributions) to figure out how many of our predictions fall more than a certain distance from the mean.
The total variablility in Y can be thought of as having two components. We have been discussing that part you cannot predict (the error variability) because the actual value does not lie on the regression line. In addition, part of the variability you can predict by using the regression equation. Y' changes with different Xs. This variability in the predicted scores is a part of the variability in Y. We can represent that variability by  . It is defined similarly to other SSs.  .
There is an important and useful relationship between the SSs.
.
These formulae help to explain conceptually what the regression line does. They are almost never used for computation. There are much easier ways of computing them.
The most often computations use r or more precisely  in relation to  .
The variance is basically the average of the squared deviations from the mean. The error variance is basically the average of the squared deviations from the regression line. The standard deviation is the square root of the variance. The standard error is the square root of the error variance. If we find the equivalent of the sum of squares of the deviations from the regression line, the rest should be easy.  and  are analogous to  is the sum of squares around the regression line.  is the sum of squares of the regression line.  , and  , but this formula is not used very much although  is often used as the proportion of the variability in Y accounted for by X.
Our text uses a direct formula (p 142) to compute sy|x  from the sums of squares of X and Y and the crossproduct of X and Y.

Regression and correlation.
There is one more relationship that should be mentioned. The correlation coefficient, r, can be used as the slope of a regression line. If you convert all of the scores in both X and Y to their respective z scores and compute a regression line on them, the result is simple and neat,  . In z score units the slope is r and the line goes through (0,0) so the constant is zero. By a little algebra you can derive the regression line from this formula. Also in z score units r2   is both the percent of the variance accounted for and a measure of the variability of  . In analogous fashion 1- r2 is a measure of the variability around the regression line

Summary on regression
We have seen how to calculate regression from data and from five basic statistics.
, and 
We have seen how regression can be used for prediction.
Using regression as a frame we have seen how  can be used as a measure of variability in Y accounted for by X and 1-r 2 as a measure of "error" variability
We have seen that Y variability can be separated into that accounted for by X and that not accounted for by X
We can relate b with r, and we have seen that r is the slope of the regression line if X and Y are transformed to z-scores
We can find the standard error of estimation using the standard deviation or sum of squares in Y and the correlation between X and Y.

Here is a regression game. The effect of adding one additional data point to a regression line . It's from Webster West at the University of South Carolina. Look at the regression equation and how it changes when you put an additional point in different places. How small a correlation can you get? What is the largest it can be?

back to syllabus