Much of statistics is concerned with the problem of obtaining information about a population from information about a sample. One very vivid application is currently in the news: polls attempt to determine the way a population will vote by examining the voting patterns within a sample.

The idea of generalizing from a sample to a population is not hard to grasp in a loose and informal way, since we do this all the time. After a few vivits to a store, for example, we notice that the produce is not fresh. So we assume that the store generally has bad produce. This is a generalization from a sample (the vegetables we have examined) to a population (all the vegetables the store sells). But there are many ways to go wrong or to misunderstand the meaning of the data obtained from a sample.

How do statisticians conceive of the process of drawing a conclusion about a population from a sample? How do they describe the information that is earned from a sample and quantify how informative it is? How much data do we need in order to reach a conclusion that is secure enough to print in a newpaper? Or on which to base medical decisions? These are the questions that we will address this week.

The simplest example arises when one uses a sample to infer a population proportion. We can give a fairly complete account of the mathematical ideas that are used in this situation, based on the binomial distribution. My aim is to enable you to understand the internal mathematical "clockwork" of how the statistical theory works.

**Read:**Chapter 8, sections 1, 2 and 3. For the time being, do not worry about pasages that contain references to the "normal distribution" of the "Central Limit Theorem" . (Last sentence on page 328, last paragraph on p. 330, first paragraph on p. 332.) Also, do not worry for the time being about the examples in section 3.2.**Review questions:**pages 335 and 351.**Problems:**p. 336: 1--8, 11, 12, 13, 14. p. 351: 1--12, 13, 16, 21, 22.**In-class:**p. 337: 20.**EXTRA CREDIT:**Find an article in the*New York Times*that describes a poll.*The New York Times*provides readers with a very careful explanantion of margin of error and level of confidence; find their explanation either in an issue of the paper or on the paper's web site, and report on it. Compare with the information provided by other papers.

**Parameters and statistics:****population mean:**the average value of a variable, where the reference class is a population of interest. E.g. the average high of all persons owning a Louisiana driver's license. This is a parameter.**sample mean:**the average value of a variable, where the reference class is a sample from the population. This is a statistic. It is also a variable that has as its refernce class all possible samples.**population proportion:**the proportion of a population with a given property. E.g., the proportion of registered voters in East Baton Rouge who are republican. This is a parameter.**sample proportion:**the proportion of a sample with the property. This is a statistic. It is also a variable that has as its refernce class all possible samples.

**Sample distribution:**the distribution of a variable whose reference class consists of all samples (of some fixed size) drawn from some population.*Example:*Consider the population of all LSU students, and consider drawing samples of size 100. The variable is the average height of the people in the sample. (Here we are looking at the disrtibution of the sample mean.)*Example:*Use the same population and the same sample size, but now consider the variable "percent male". This is again a something that can measured in each sample. The sampling distribution tells us the relative frequency of each possible sample percent in the reference class of all samples.

**Margin of error:**a bound that we can confidently place on the the difference between an estimate of something and the true value.**Level of confidence:**a measure of how confident we are in a given marin of error.

We will concentrate on the estimating **population
proportions** by sampling.

- If we were to take many samples (of a given size) from a population that
was 40% democratic (say), then few samples would have exactly 40% democats.
Most would be close to 40%, but they would differ by varying small amounts.

- If many random samples of size 100 are drawn from a large population (of democrats and non-democrats), then we can expect better than 95% of the samples to have a statistic (proportion of democrats in sample) within one tenth of the true value of the parameter (proportion of democrats in the population). (For example, if the true proportion is 40%, then better than 19 out of 20 samples of size 100 will have a proportion between 30% and 50%.)

The meaning of margin of error and level of confidence.

- What you know about a population when you have a sample of size 100 is similar
to what you know about the contents of a jar of gum balls if you have the
following information:

- there are only two colors of gum ball in the jar (yellow and purple, say);
- there are 19 gum balls of one color and 1 of the other;
- you have reached in and drawn a ball at random; it is yellow.

If you make it your policy under such situations to bet that yellow is the predominant color, in the long run you will be right 19 out of 20 times. Similarly, when I say that a certian survey method has margin of error of plus or minus E at a level of conficence of x%, what I mean is that when that method is used over and over, in x% of all cases, the true value of th parameter will be within E of the statistic.

The margin of error an level of confidence depend on
the **sample size** (and NOT on population size):

- The size of the population being studied---provided it is much bigger than the samples and provided that the sample is truly random---does not matter. (In practice, one of the greatest challenges to the researcher is to get a truly random sample.)
- The margin of error at 95% confidence is about equal to or smaller than the square root of the reciprocal of the sample size. Thus, samples of 400 have a margin of error of less than around 1/20 at 95% confidence.
- To halve the margin of error at a given confidence level, quadruple the sample size.
- The margin of error and the level of confidence are tied together. A better (i.e., narrower) margin of error may be traded for a lesser level of confidence, or a higer level of confidence may be obtiner by tolerating a larger margin of error.

The underlying idea that explains how we can determine the reliability of statistics
is the notion of sampling distribution. In order to talk about this, I introduce
a new term: by a "**p**-population",
I mean a very large population that has proportion **p** of some
characteristic that is of interest, e.g., democrat.

- The binomial distribution tells us EXACTLY how likely it is for a random
sample of size
**n**from a**p**-population to have exactly**k**members with the characteristic of interest. - Exact values for margin of error and level of confidence of statistics on populaion proportions are derived from the binomial distribution.

- Explain the vocabulary, above and illustrate with examples.
- Explain what it means when a reporter or researcher says that a poll has a margin of error of 3 percentage points (say) at a level of confidence 95% (say).
- Use a table to determine the levels of confidence and margins of error that can be obtained with various sample sizes when attempting to determine population proportions.
- Use the sqare root law to estimate the sample size needed to get a given margin of error better than 95% confidence. (See text, page 350.)

**Assessments:**

- A jar of colored beads may be an analogy for more meaningful situations
that you might encounter (e.g., in the news). List some examples and draw
the analogy explicitly. For example: someone wants to predict the outcome
of an election by means of an exit poll. All the people who voted are analogous
to all the beads in the jar. The color of bead is analogous to the vote---e.g.
*color***:***bead***: :***candidate voted for***:***voter*. The people who are questioned in the poll are analogous to the sample. - Suppose a large population is 40% red. Imagine that you have drawn a sample of size 20 from this population. Describe what you think a typical sample might be like.
- Suppose that you have drawn a sample of size 20 from a population of unknown proportion red, and that our sample is 40% red. What do you think you cn deduce about the population?
- A random sample of size 100 from a population of voters is 52% Republican. What do think the true proportion of Republicans in the population is?
- Do you know anything
*more*than just that the true proprtion is near 52%? - Imagine a large bin with pieces of paper---or a jar filled with colored beads. Describe what we would do in order to estimate the sampling distribution empirically.
- If we draw 1000 samples, each of size 400, from a population that is 30% red, then how many samples will have a statistic of exactly 30% (the population proportion that you decided to work with)? What will the greatest deviation from p be?