Learning Math: Data Analysis, Statistics, and Probability
Random Sampling and Estimation Part B: Selecting the Sample (30 minutes)
In This Part: Fair Sampling
You may have noticed that your estimates for the total penguin population vary quite a bit based on both the sample size and which sub-regions were sampled. The decision about how to select a sample, accordingly, is a critical one in statistics. It is important that each part of the population be treated fairly. If you are fair in the selection, then you should obtain a representative sample and thus a more fair estimation procedure.
In earlier sessions, you looked at notions of fairness and randomness and noticed that people have a difficult time being fair or random. So what methods can you use to accomplish fair sampling? See
How might you select 10 sub-regions from the 100 total sub-regions so that you would be most likely to have a “representative” sample for estimating the size of the penguin population in the entire region? You can use the empty chart below to explore your ideas.
To select 10 sub-regions from the 100 total sub-regions in a “fair” way requires that each of the 100 sub-regions has the same chance of being selected. You can accomplish this with random selection. How might you select 10 sub-regions in a random fashion?
In this video segment, groups of participants devise methods for collecting a random sample of penguins. Watch this segment after you have completed Problem B1 and compare your method with that of the onscreen participants. Do these methods ensure that the samples will be random?
In This Part: A Fair Sampling Method
There are many different ways to randomly select 10 sub-regions. Many of these methods involve initially numbering the 100 sub-regions. In this section, we will use the numbering system below, which numbers the sub-regions from 00 through 99:
Locating number positions is easier if we put digits on the outside borders as shown. Each number in the grid corresponds to a red and blue number combination; the red number is the first digit, and the blue number is the second digit:
Think of a way to pick 10 numbers between 00 and 99 at random. (You may prefer to select each digit individually, or to select the entire two-digit number at once.) Then use your method to generate the 10 random numbers.
You may wish to use a random-number-generating device, such as a calculator, a 10-sided die, or computer software, to generate the random numbers.
One possible method for solving Problem B2 is to use two 10-sided dice, one red and one blue. The sub-region can then be determined by the two dice (in the order red, and then blue).
You might notice that the random selection process will sometimes produce duplicates. There is a greater than one-third chance that 10 numbers picked at random between 00 and 99 will produce at least one duplicate, and almost a 90% chance that 20 such numbers will produce at least one duplicate.
For instance, you might find that seven tosses of the dice produced these sub-region choices:
19 22 39 50 34 05 39
If we do not want duplicates, we can skip them until we get 10 distinct numbers, for example:
19 22 39 50 34 05 75 62 87 13
This is called sampling without replacement, since each time we choose a sub-region we remove it from the list of sub-regions we can choose on the next toss of the dice. In some experiments, it may be impractical or impossible to exclude duplicates from the random selection process. If duplicates are allowed, it is called sampling with replacement.
The 10 distinct numbers (19, 22, 39, 50, 34, 05, 75, 62, 87, 13) correspond to these 10 sub-regions:
Here is a look at the number of penguins in each of the 10 sub-regions we selected:
The estimate of the total number of penguins for the entire region based on this random sample of 10 sub-regions is as follows:
100 x [(5 + 6 + 6 + 7 + 5 + 2 + 1+ 5 + 5 + 3)/10] = 100 x (45/10) = 450
Use the random sample you found in Problem B2 to estimate the total number of penguins in the region. Find your 10 random sub-regions in the chart below:
Did you expect your estimate from Problem B3 to equal your estimate from Problem B2? Why or why not? What explains this variation? If the sample size were increased to 20 sub-regions, would you expect the variation in the estimates to increase or decrease? Why?
In This Part: Variation in Estimates
A computer can perform random sampling and estimation much more quickly than you can by hand. Here are three more random samples of 10 sub-regions generated by a computer.
Take time to develop your own ideas. There are many different ways to randomly select 10 sub-regions. Developing a method of selection will help you clarify the concept as well as provide a tool for the practice of sampling. After you have considered your own methods, you can then investigate the specific methods introduced in Part B.
Answers will vary, as there are many possible ways to do this. One possibility is to take the 100 pictures of the sub-regions, shuffle them thoroughly, then look at the first 10. Another is to assign each sub-region to a number from 00 to 99, and use the last two digits of the daily lottery number for each of the last 10 days. A commonly used method for assigning regions to numbers is to use a random-number-generating device, such as a calculator, a die, or computer software.
With a calculator, the first two decimal digits of the random number will range from 00 to 99, and each of the 100 values is equally likely. If a number appears more than once, it is rejected, so that 10 different sub-regions are selected. Another idea is to use a 10-sided die or spinner and to generate two random digits by two tosses or spins (and get your 10 random numbers by 20 tosses or spins).
Answers will vary, depending on which region you selected in Problem B2. As an example, the random sequence (96, 74, 61, 21, 49, 37, 82, 35, 18, 68) determines this sample of 10 sub-regions:
The estimate of the total number of penguins is
100 x [(5 + 4 + 4 + 6 + 4 + 5 + 6 +5 + 3 + 7)/10] = 100 x (49/10) = 490.
While it is possible for the two estimates to be equal, it is pretty unlikely, due to the variation in the individual sub-regions. If the number of sub-regions in the sample increases to 20, the variation in the estimates should be reduced. The estimates should be closer to the actual value, but it is no more likely that they will be equal.
Answers will vary. To determine how many penguins there are in the region, you might calculate the mean or median of the set of five estimates.
Session 1 Statistics As Problem Solving
Consider statistics as a problem-solving process and examine its four components: asking questions, collecting appropriate data, analyzing the data, and interpreting the results. This session investigates the nature of data and its potential sources of variation. Variables, bias, and random sampling are introduced.
Session 2 Data Organization and Representation
Explore different ways of representing, analyzing, and interpreting data, including line plots, frequency tables, cumulative and relative frequency tables, and bar graphs. Learn how to use intervals to describe variation in data. Learn how to determine and understand the median.
Session 3 Describing Distributions
Continue learning about organizing and grouping data in different graphs and tables. Learn how to analyze and interpret variation in data by using stem and leaf plots and histograms. Learn about relative and cumulative frequency.
Session 4 Min, Max and the Five-Number Summary
Investigate various approaches for summarizing variation in data, and learn how dividing data into groups can help provide other types of answers to statistical questions. Understand numerical and graphic representations of the minimum, the maximum, the median, and quartiles. Learn how to create a box plot.
Session 5 Variation About the Mean
Explore the concept of the mean and how variation in data can be described relative to the mean. Concepts include fair and unfair allocations, and how to measure variation about the mean.
Session 6 Designing Experiments
Examine how to collect and compare data from observational and experimental studies, and learn how to set up your own experimental studies.
Session 7 Bivariate Data and Analysis
Analyze bivariate data and understand the concepts of association and co-variation between two quantitative variables. Explore scatter plots, the least squares line, and modeling linear relationships.
Session 8 Probability
Investigate some basic concepts of probability and the relationship between statistics and probability. Learn about random events, games of chance, mathematical and experimental probability, tree diagrams, and the binomial probability model.
Session 9 Random Sampling and Estimation
Learn how to select a random sample and use it to estimate characteristics of an entire population. Learn how to describe variation in estimates, and the effect of sample size on an estimate's accuracy.
Session 10 Classroom Case Studies, Grades K-2
Explore how the concepts developed in this course can be applied through a case study of a K-2 teacher, Ellen Sabanosh, a former course participant who has adapted her new knowledge to her classroom.
Session 11 Classroom Case Studies, Grades 3-5
Explore how the concepts developed in this course can be applied through case studies of a grade 3-5 teacher, Suzanne L'Esperance and grade 6-8 teacher, Paul Snowden, both former course participants who have adapted their new knowledge to their classrooms.