Psy 207    Introduction to Statistics
Correlation and covariance

Erwin Segal
back to syllabus

Homework on Homework Page  
One very important issue in statistics concerns the relationships between two or more measures on the same individual. In some sense, most uses of statistics is concerned in some way with this issue.
Some correlations that have been studied or are in the news recently include:
What is the relationship between smoking and lung cancer?
What is the relationship between SAT scores and grades in college?
What is the relationship between education and income?
What is the relationship between fossil fuel exhausts and Global Temperature?

With the census type data questions we could ask include:
What is the relation between number of immigrants and average income?
How do regions of the country differ in the percent of adults that graduated college?
How does crime rate correlate with average age of the community?

This and the next topic in this course will explore direct measures of the relationship between measures. The primary statistic that is used is the Pearson Product Moment Correlation Coefficient, often called Pearson r, or simply r. Pearson r measures the strength of the linear relation between two quantitative measures on each of a set of individuals. Every individual has two measures (X, Y) ascribed to it.
Correlation is a measure of how these two measures covary across individuals. Do high scores in one variable go with high scores in the other? If so, how strong is the relationship? If not, how are the two scores for one individual related?
This relationship can be depicted in a scatterplot, or a bivariate frequency distribution. This is plotted on a standard rectangular or Cartesian Coordinate System, wth one variable represented on the abscissa or X-axis and one on the ordinate or Y-axis. Unlike a univariate hisogram or frequency polygon, both dimensions represent scores. An individual is represented as a point on the plane, vertically above (or below) its value on the x dimension and horizontally to the left (or right) of its value on the y dimension. How the points lay out on the plane is the relationship that r tries to describe.

Pearson r is a pure measure of the strength of the linear relation between two variables. Pearson r is a function of the "best fit straight line" that one can draw through the points. Best fit is defined in a particular way (when the sum of the squared deviations from the line are at a minimum) that is widely used in statistics. It is analogous to the way that a variance is identified in univariate statistics. Pearson  r is different from a variance in several ways.
1. r is unit free, i.e. it has no units, whereas the unit for a variance is squared score units; the value is unaffected by the units of the measures. This makes it easy to compare relationships across different measures and conditions and gives r some added features.
2. A variance can never be a negative number, whereas r can be positive, negative, or zero.
3. Variances can have any magnitude whatsoever, whereas r has a value .

One definition of r is based on z scores. If the x and y measures for each individual in the population are converted to z scores, so that both variables have a mean of zero and a standard deviation of 1, then r is defined as the average of the product of the z scores  . If we computed z scores based on sample estimates of population variances, we have to "correct back" and use the formula that is in the book.  . Although this definitional formula makes sense, and can be used, it is tremendously labor intensive and it hides some important relationships. We will introduce some mathematically equivalent formulae. These are easier to compute and have components that are useable in other contexts. (If you substitute the definition of z  for the x and y variables, you get this equation:  . The numerator of this r. i.e.,  is known as the sum of the cross products, SP. It is analogous to the sum of squares, SS,  . Substituting computational formulae for the two standard deviations and SP
the computational formula for r emerges. , the numerator is SP,  , and the denominator consists of the square root of SSx, times SSy.
thus  . This is a fine computing formula for r, with

, Note the similarity of SP with SS. SP uses one X and one Y where SS uses two Xs or two Ys.
 



Properties of r

• A measure of the linear relationship between two measures on each individual in a set.
• If the possible range of measures is restricted, the computed |r| will tend to be smaller
•    (called the coefficient of determination) measures the proportion of common variance. It represents the reduction in variance of  Y as a function of X (or of X as a function of Y).
• If you know Xi  you can predict Yi  with more accuracy using 
•  is the coefficient of nondetermination. It shows the proportion (percent) of variance in X which is not related to Y.
• r shows a mathematical relationship between two variables; not a causal one. Even if the correlation is high, cause could go in either direction, both directions, or some unmeasured factor might affect both measures.

This is a game for beginning to learn something about what the different sizes of correlation coefficients (r) mean.
back to syllabus