Erwin Segal
back to syllabus
Homework on Homework PageOne very important issue in statistics concerns the relationships between two or more measures on the same individual. In some sense, most uses of statistics is concerned in some way with this issue.
With the census type data questions we could ask include:
What is the relation between number of immigrants and average
income?
How do regions of the country differ in the percent of adults
that graduated college?
How does crime rate correlate with average age of the community?
This and the next topic in this course will explore
direct measures of the relationship between measures. The primary statistic
that is used is the Pearson Product Moment Correlation Coefficient, often
called Pearson r, or simply
r. Pearson r measures the strength of the linear relation between
two quantitative measures on each of a set of individuals. Every individual
has two measures (X, Y) ascribed to it.
Correlation is a measure of how these two measures covary
across individuals. Do high scores in one variable go with high scores in
the other? If so, how strong is the relationship? If not, how are the two
scores for one individual related?
This relationship can be depicted in a
scatterplot, or a bivariate
frequency distribution. This is plotted on a standard rectangular or Cartesian
Coordinate System, wth one variable represented on the abscissa or X-axis
and one on the ordinate or Y-axis. Unlike a univariate hisogram or frequency
polygon, both dimensions represent scores. An individual is represented as
a point on the plane, vertically above (or below) its value on the x dimension
and horizontally to the left (or right) of its value on the y dimension.
How the points lay out on the plane is the relationship that r tries to describe.
One definition of r is based on z scores. If the x and
y measures for each individual in the population are converted to z scores,
so that both variables have a mean of zero and a standard deviation of 1,
then r is defined as the average of the product of the z scores
. If we computed z scores based on sample estimates of population variances,
we have to "correct back" and use the formula that is in the book.
. Although this definitional formula makes sense, and can be used, it is tremendously
labor intensive and it hides some important relationships. We will introduce
some mathematically equivalent formulae. These are easier to compute and
have components that are useable in other contexts. (If you substitute the
definition of z
,
for the x and y variables, you get this equation:
. The numerator of this r. i.e.,
is known as the sum of the cross products, SP. It is analogous to the sum
of squares, SS,
. Substituting computational formulae for the two standard deviations and
SP
the computational formula for r emerges.
, the numerator is SP,
, and the denominator consists of the square root of SSx, times SSy.
thus
. This is a fine computing formula for r, with
,
,
, Note the similarity of SP with SS. SP uses one X and one
Y where SS uses two Xs or two Ys.
Properties
of r
This is a game
for beginning to learn something about what the different sizes of correlation
coefficients (r) mean.
back to syllabus