Private: Learning Math: Data Analysis, Statistics, and Probability
Statistics As Problem Solving Part D: Bias in Sampling (20 minutes)
In This Part: Population and Sample
In data analysis, we use graphs, tables, and numerical summaries to study the variation present in our data. Often, we want to extend our interpretation to a larger group beyond the particular group studied. Such generalizations are only valid, however, if the data we examine are representative of that larger group. If not, our interpretation may misrepresent the larger group! See Note 4.
The entire group that we want information about is called the population. We can gain information about this group by examining a portion of the population, called a sample.
To gain useful information, the sample must be representative of the population. A representative sample is one in which the relevant characteristics of the sample members are generally the same as the characteristics of the population.
There are several good reasons that we use samples to study populations; chief among them are feasibility and cost. For instance, in a nationwide political survey of the population of all voters in the United States, it would be difficult, if not impossible, to poll every voter. It would also be quite expensive. Statistical theory shows that a survey of a 1,000 carefully selected voters suffices to represent the opinions of the millions of people in the population of voters.
Another problem in answering questions about a population arises when we want to inspect or test products. For example, testing an airbag to see if it works properly means that we have to destroy it. We certainly can’t test every airbag, but testing a carefully selected sample of airbags will tell us what we need to know about all the airbags in the population.
Think of a statistical question and a population. How could you determine a representative sample of that population? What would be a sample that is not representative?
A population might be the students at a certain school, the members of the Republican party, or all the soda cans shipped to the nearest convenience store this year. A representative sample must have all the same characteristics as the population.
How we select a sample is extremely important. Improper or biased sample selection can produce misleading conclusions. Sample selection is biased if it systematically favors certain outcomes. If we select only Democrats to participate in a political survey, the outcome will reflect Democrats’ opinions, but not other political parties’. If we personally select a sample of students we know and like for a school survey, we have just eliminated the differing opinions of those whom we do not know and like. We need to select our sample in an unbiased fashion.
In This Part: Random Sampling
Random sampling is a way to remove bias in sample selection. For example, to pick a random sample of 20 people out of a population of a 1,000, you might put all 1,000 names in a hat, then draw 20 of them. Random sampling attempts to reduce bias in sample selection, since every member of the population has an equal chance of being selected. See Note 5.
In this Interactive Activity, you will have the opportunity to see if you can personally select a sample that is representative of a particular population.
Here are 60 circles. Can you select five circles that best represent the size of all the circles? (The average size of the five circles should equal the average size of all the circles).
Then look at the picture for no longer than 20 seconds. Mark the five circles you choose. Use the scale on the picture to measure the diameter of those five circles. Find the average diameter of your sample.
The average diameter of all 60 circles is 1 unit. How close to that is your sample?
(Note that a computer, selecting any five of the 60 circles randomly, might generate average diameters ranging from as small as 0.5 units to as large as 2.2 units.)
Can you think of any circumstances in which it would be difficult or impossible to select a simple random sample?
You may have noticed that each of the problems you looked at in this session began with a question. Providing answers to questions like these is the goal of statistics. But sometimes, the variation in our data makes it difficult to answer statistical questions.
In order to identify any patterns present in the variation, we must analyze our data by organizing and summarizing it. Once this analysis is complete, we can interpret the data to answer our questions. In later sessions, we will look at the analysis and interpretation components in more detail.
It’s also important to remember that when you conduct a statistical investigation, the question you pose is designed to investigate a group (“the population”). The results of an investigation involving a sample are frequently used to draw conclusions about the entire population. If an attempt is made to include every individual from the population in a sample, then the investigation is called a census.
Why is a census still considered a sample?
A voter poll taken during the 1936 presidential election provides a good example of the danger of biased sampling. The magazine Literary Digest sent a survey to 10 million Americans to determine how they would vote in the upcoming election between Democrat Franklin Roosevelt and Republican Alf Landon. More than two million Americans responded to this poll, and 60% supported Landon. The magazine published these findings, suggesting that Landon was guaranteed to win the election.
Despite the findings of the poll, however, Roosevelt defeated Landon in one of the largest landslide presidential elections ever. What happened? The sample used in the Literary Digest poll — a sample collected through magazine subscription lists, lists of car owners, and telephone directories — was not representative. Not all Americans at this time owned cars, had telephones, or subscribed to magazines. Moreover, Democrats were much less likely to own a car or have a telephone, and thus were less likely to be included in the sample. As a result, the sample was not representative, and the poll did not predict the outcome of the election.
Good sampling practices rely on some form of random selection in order to remove the bias caused by human involvement in the selection process. The Interactive Activity in Part D is intended to demonstrate how human selection might result in biased results. You are asked to select a sample of five circles from a population of 60 circles in order to estimate the size of the circles in the entire population. You will then compare the accuracy of your sample with the accuracy of a random sample. A bias should appear: Most people tend to pick a sample that greatly overestimates the size of the circles.
One such question is “Are girls better math students than boys?” Consider this question for the population of a certain school. A representative sample would be a selection of girls across grade and ability levels and a selection of boys across grade and ability levels. An unrepresentative sample might select only one grade level or one ability level. Comparing the girls and boys in the most challenging math course at the school would be a very unrepresentative sample.
It is difficult to select a simple random sample if full information about the population is not available. It would be extremely difficult to select a simple random sample of the world’s ant population, for instance, since it would be impractical (if not impossible) to obtain enough information about the population to set up the random sample.
A census is still considered a sample because there is no guarantee that the attempt to include everyone has been successful. For example, every 10 years, the U.S. population census misses between 1% and 3 % of the individuals in the population, and accidentally counts some people more than once. A full census for all but the smallest populations would be impossible to complete successfully.
Session 1 Statistics As Problem Solving
Consider statistics as a problem-solving process and examine its four components: asking questions, collecting appropriate data, analyzing the data, and interpreting the results. This session investigates the nature of data and its potential sources of variation. Variables, bias, and random sampling are introduced.
Session 2 Data Organization and Representation
Explore different ways of representing, analyzing, and interpreting data, including line plots, frequency tables, cumulative and relative frequency tables, and bar graphs. Learn how to use intervals to describe variation in data. Learn how to determine and understand the median.
Session 3 Describing Distributions
Continue learning about organizing and grouping data in different graphs and tables. Learn how to analyze and interpret variation in data by using stem and leaf plots and histograms. Learn about relative and cumulative frequency.
Session 4 Min, Max and the Five-Number Summary
Investigate various approaches for summarizing variation in data, and learn how dividing data into groups can help provide other types of answers to statistical questions. Understand numerical and graphic representations of the minimum, the maximum, the median, and quartiles. Learn how to create a box plot.
Session 5 Variation About the Mean
Explore the concept of the mean and how variation in data can be described relative to the mean. Concepts include fair and unfair allocations, and how to measure variation about the mean.
Session 6 Designing Experiments
Examine how to collect and compare data from observational and experimental studies, and learn how to set up your own experimental studies.
Session 7 Bivariate Data and Analysis
Analyze bivariate data and understand the concepts of association and co-variation between two quantitative variables. Explore scatter plots, the least squares line, and modeling linear relationships.
Session 8 Probability
Investigate some basic concepts of probability and the relationship between statistics and probability. Learn about random events, games of chance, mathematical and experimental probability, tree diagrams, and the binomial probability model.
Session 9 Random Sampling and Estimation
Learn how to select a random sample and use it to estimate characteristics of an entire population. Learn how to describe variation in estimates, and the effect of sample size on an estimate's accuracy.
Session 10 Classroom Case Studies, Grades K-2
Explore how the concepts developed in this course can be applied through a case study of a K-2 teacher, Ellen Sabanosh, a former course participant who has adapted her new knowledge to her classroom.
Session 11 Classroom Case Studies, Grades 3-5
Explore how the concepts developed in this course can be applied through case studies of a grade 3-5 teacher, Suzanne L'Esperance and grade 6-8 teacher, Paul Snowden, both former course participants who have adapted their new knowledge to their classrooms.