Psy 207B      Introduction to Statistics

Erwin Segal
Back to syllabus
Homework:  (Link to homework page)

Inferential Statistics and Sampling Distributions

        Inferential statistics has as its purpose asking questions and making decisions about the world that extends beyond the data at hand.       Such questions are often answered by considering possible outcomes of experiments under different possible condition of the world. For example, if smoking is a cause of lung cancer, then a greater proportion of people who smoke than those who do not should develop lung cancer. If smoking is not a cause of lung cancer, then the proportion of smokers who have lung cancer should not differ from the proportion of non-smokers who have the disease. By collecting data and using principles of probability researcher can decide which of these possibilities they believe to be true. The data collected are considered to be a sample taken from a much larger population. The population has the "true" properties that exist in the world. The sample may or may not represent the population accurately.
        In the example introduced in Chapter 10 of the text, if marijuana aids the appetite of AIDS patients, then patients should eat more after taking a THC pill than after taking a placebo. Statistical analyses of data will let us answer such questions as "Does marijuana aid the appetite of AIDS patients?" rationally.

Three Important Concepts
        Three important concepts lie beneath most inferential statistics, Population, Sample, and Sampling distribution.
1. Populations are the sets of individuals that the statistician is interested in learning about. They (in some sense) exist in the real world independent of any direct actions by the researcher, although they are interested in possible measures on them. Some examples can be heights of adults in the world; appetites of people with late stage AIDS;  Lung capacity of NFL linemen; Probability of heads in a flip of a coin. Voting preference of eligible voters in New York; Average number of cigarettes smoked daily by someone one month before the detection of lung cancer. Each of these variables in their respective populations could be represented in a frequency distribution, or frequency density function. The population distribution can be thought of as having a mean, represented by m, (the Greek letter mu) and a standard deviation, represented by s, (the Greek letter sigma). m and s are often called population parameters.
        There is a major problem that often interferes with the use of population parameters to help us make statistical decisions. We very seldom know exactly what the population parameters actually are. One task of inferential statistics is estimating what some of these parameters are. Another important task is to guess what some population parameters might be and then see whether these parameters are consistent with the data collected.

2. Samples are the source of data actually collected by the statistician. A sample is a set of individuals taken from the population and the variable of interest is actually measured on each of them. For statistics one almost always needs the sample to be a random sample from the population. A random sample is defined as one in which each individual in the population has an equal chance of being selected. A sample has a certain size, represented by N, and other statistics are computed from it, often the mean, , and standard deviation, s.

3. Sampling Distributions: 'Sampling distribution' is one of the hardest concepts to understand in statistics. It is a probability distribution of a statistic where that statistic is computed from each of an infinite number of random samples generated in the same way as the sample we have taken from the population. For example, if a random sample with N=10 is taken from a population with m = 100, and s = 10, a sampling distribution of means would be the probability distribution of means computed from an infinite number of random samples with N=10 taken from that population. If the properties of the population were known, then the properties of the sampling distribution of a statistic ('s, s's, proportions of hits, etc.) can be computed.
        One critical issue is that we do not know whether the population we are interested in has the parameters that we are using to generate the sampling distribution. It just tells us IF these are the parameters of the population, THEN the sampling distribution is the probability distribution of the statistic in mind.
        Hypothesis testing, parameter estimation, and power of tests all flow from the concept of sampling distributions.  In Hypothesis Testing the probability (or likelihood) of getting outcomes similar to those we actually obtained if the hypotheses we make about the populationare true inform the decisions we make. If our sample has some property, such as a mean, that is not likely to occur in the sampling distribution it is tested against, we conclude that the population from which the sampling distribution is generated is NOT the population from which we have a random sample. We thus reject the hypothesis that is being tested, and decide that some alternative hypothesis is correct..
        In Parameter Estimation, we ask what are the least likely sampling distributions our actual sample statistic could still have a high enough probability of occurring randomly for us to say it may have come from that population. These least likely sampling distributions set the outside bounds of our parameter estimation. We know that our population parameter most likely, to a specified probability falls within these bounds. We claim that the true population parameter lies within these bounds.
        In evaluating the Power of a test, we ask how likely we are to get a random sample that is far enough in the tails of our tested distribution (the null hypothesis) to decide that it is false when in reality it is false, i.e., when the sampling distribution we are REALLY sampling from is NOT the sampling distribution we are testing. We know that the more powerful the test, the more likely we are to reject FALSE (null) hypotheses.
        These are the things that we do in inferential statistics.

 Populations and samples

Sampling Distributions
Let us look at the mathematical relationship between the properties of sampling distributions and properties of the populations they derive from.
Sampling distributions of the Mean

Consider a population, F, the distribution of IQ’s. It's a normal distribution with a mean, m = 100, and standard deviation, s = 15.

Let's randomly select an individual from F, and compute its mean. This is the same experiment as randomly selecting an individual and reporting her measure. What would you expect this "mean" to be? It is expected to be 100. That is because on the average if this "experiment" is run many, many times and the distribution of means (scores) is plotted, it would have m=100. (This is our assumption in designing the problem. The distribution would also have a standard deviation, or if we think of each measure as a sample of one, a standard error of the mean, . This means that if you guessed the score as 100, you would miss your guess on the average, roughly speaking, by about 15 IQ points. Thus, a sampling distribution of the mean with n = 1 gives a distribution which looks like the distribution of the population, but can now be thought of as a probability distribution. What if for an experiment we randomly sample 2 people and computed ? You would again predict to be 100, because = 100.
With n=2 on the average your prediction would be closer than with n=1. Since the standard error of the sampling distribution of means =15/Ö 2, you would miss your estimate on the average by only 10.61. This sampling distribution has the same mean as the population, but a smaller standard error (standard deviation).

If we randomly sample 10 people in our experiment, we would again expect to be 100. But on the average we would be off by only 15/Ö 10 = 4.74.

A sampling distribution is a probability distribution. If it were normally distributed, about 68% of the means of random samples would be within one standard error of the mean of the population; 95% of the sample means would be within about 2 standard errors of the population mean. Only about 5 out of every 100 samples would have a mean greater than 2 (actually 1.96) standard errors from the mean.

Hypothesis testing 1

One can ask whether a random sample is a sample from a hypothesized sampling distribution by considering probabilities of events. If it is an unlikely point in a distribution we can assume that it did not come from that sampling distribution.

How can we test whether H0 is in error?
Example of Hypothesis testing

We want to test whether a coin is truly fair. The null hypothesis H0 is that a sample of coin flips with this coin would come from a sampling distribution with p(Heads) = .5. Let us test with a = .05

Looking this up in the z tables, we see that a z score this large or larger has a probability of .0548. Since we had no reason to think that heads was more likely than tails, we use a two tailed test. That is, we will consider how far from a 50-50 split our data is regardless of direction. There is a 5.48% chance to have a z score as large as 1.60 and there is an equal chance of having a z score as small as -1.60 if H0 is true. Thus we could have a score this deviant over 10% of the time by chance alone. This is not considered far enough from the expected value to reject H0.