Learning Math: Data Analysis, Statistics, and Probability
Random Sampling and Estimation Part C: Investigating Variation in Estimates (45 minutes)
In This Part: Using a Stem and Leaf Plot
In Part B, you obtained several different estimates for the total number of penguins in this region based on the different samples you chose. See Note 5 below.
We can use a stem and leaf plot to help us organize the estimates and to determine any patterns that exist in the distribution of the estimates.
In the manner you used to generate your own estimates, 100 estimates of the penguin population count were produced from independently selected random samples of size 10. Here are these 100 estimates in a stem and leaf plot, where the intervals are of size 50:
Note that in this stem and leaf plot, the spacing on the stems is 50. For example, the stem marked “3L” displays all the estimates between 300 and 349, while the stem marked “3H” displays all the estimates between 350 and 399. Also, since the samples are of size 10, all the estimates are multiples of 10.
a. Based on the stem and leaf plot of these 100 estimates, make a guess for the actual number of penguins in the region.
b. Give an interval of values in which you are fairly certain the actual number of penguins in the region lies. (This interval should include the guess you made in the question above!)
The interval should include most of the data in the stem and leaf plot. For example, “between 200 and 400” would be a very poor interval of values.
In This Part: Judging the Quality of Estimates
Here is the entire region we’ve been studying. If you actually count all the penguins, you’d find that there are exactly 500. Notice that the number of penguins varies from sub-region to sub-region. Some of the squares in the grid contain as few as one penguin, and some contain as many as nine. On average, each of the 100 sub-regions contains five penguins.
Now that you know the actual total number of penguins, let’s examine the stem and leaf plot of the 100 estimates from sample size 10:
a. What is the best estimate? For how many samples did this estimate occur?
b. What are the six worst estimates?
c. What percentage of the estimates are 50 or fewer penguins away from the actual total?
d. What percentage of the estimates are 100 or fewer penguins away from the actual total?
Since there are 100 samples, the percentage will be the actual number of estimates found.
In This Part: Intervals
The six worst estimates are shown in pink on the stem and leaf plot:
These six estimates are the most different from 500. Specifically, they differ from 500 by more than 100; they are either less than 400 or greater than 600.
The other 94 estimates differ from 500 by 100 or less. These 94 estimates fall between 400 and 600, inclusive.
We’ll refer to these inclusive ranges of values as intervals, and use such intervals to classify the estimates:
• Ninety-four of 100 (94/100) estimates fall between 400 and 600 (inclusive).
• Six of 100 (6/100) estimates fall outside of this interval.
a. What proportion of the estimates are 75 or fewer penguins away from the actual value?
b. What proportion of the estimates are more than 75 penguins away from the actual value?
Use this stem and leaf plot to complete the table below:
In This Part: Describing Intervals
These six intervals provide a description of how widely the estimates vary from sample to sample, and how close the estimates are to the actual value of 500: See Note 6 below.
The interval from 350 to 650 is the largest interval in the table above; its interval range is 300. This tells us two things:
• All (100/100) of the estimates are between 350 and 650, a range of 300.
• These estimates fall within 150 (the interval radius) of 500.
The interval 475 to 525 is the smallest interval in the table. This tells us two things:
a. Explain why it is useful for the proportion of estimates in an interval to be high.
b. Explain why it is useful for the interval range to be small.
c. What happens to the proportion of estimates in the interval as the interval range decreases?
In This Part: Probabilities
We have worked with a stem and leaf plot of the distribution of estimates of a population based on 100 random samples of size 10. The display is reasonably bell-shaped, with estimates occurring on both sides of 500 (the actual total number of penguins). There is a concentration of estimates around 500, with fewer estimates occurring as you move farther away from 500. See Note 7 below
We can think of these estimates as “typical” of what you would get if you were to select another 100 samples of size 10. That is, you would generate a similar (but not exactly the same) distribution. The stem and leaf plot would also be similar, and you would expect about the same proportions of estimates to fall into the intervals we identified earlier.
Under normal circumstances, if you were asked to estimate the size of a population, you wouldn’t already know the population size — otherwise, you wouldn’t need to estimate it! Also, you would not repeatedly select samples as we did in this session. In practice, you take only one sample to make your estimate based on the results in your sample.
How can you predict how accurate that one sample is likely to be? For our problem of counting penguins, we can use probability to make that prediction, using the “typical” distribution we found for the 100 samples:
Let’s say that the one sample you found yielded 360 for your estimate. This is not a very good estimate, since the actual population size is 500. But since only two of our samples produced this estimate, the probability of coming up with that estimate is only about 2/100.
On the other hand, your sample might generate an estimate of 500, right on target! Your probability for this is approximately 8/100, because eight of the samples produced an estimate of 500.
Here is the table of intervals from Problem C5:
The presentation used here is based on 100 estimates of the size of the penguin population, which were produced from independently selected random samples of size 10. These are “typical” of what you would expect to get if another 100 samples of size 10 were selected: You would obtain a similar (but not exactly the same) pattern exhibited in the stem and leaf plot of estimates.
Though statistics textbooks might be more likely to use a “continuous” model to illustrate the idea of sampling distributions, this is a somewhat more concrete and accessible way to demonstrate the same concepts.
The use of intervals demonstrated in this session is a very important statistical idea. It is the conceptual basis for the Confidence Interval Estimation. More advanced texts will use continuous models, such as the normal distribution, as approximate descriptions of sampling distributions, and then develop interval ideas based on these models. The intent here is to provide an understanding of the concepts in a less formal and perhaps more readily understandable setting.
The normal distribution curve is symmetric and bell-shaped. It is characterized by the mean and the standard deviation (see below). The mean is located at the center of the distribution curve, and the standard deviation determines the width of that curve. Approximately 68% of the data values fall within one standard deviation of the mean, and 95% of the data values fall within two standard deviations of the mean.
a. A good estimate might be the median of the 100 estimates, which is in position (100 + 1)/2 = 50.5. This means that the median is the average of the 50th and 51st values in the ordered list. Both the 50th and 51st values are 500, so 500 penguins is a good estimate, based on the median.
b. It seems very likely that the actual number is between 360 and 620, since all 100 estimates fall in this range. A tighter range is 450 to 550, which includes 69 of the 100 estimates.
a. The best estimate is 500, which is exactly right. Our sampling found this estimate eight of 100 times.
b. The six worst estimates are 360, 360, 390, 610, 610, and 620. These are the only six estimates that are more than 100 penguins away from the actual value.
c. These are the estimates between 450 and 550 (inclusive); 69% (69/100) of the estimates are within this range.
d. These are the estimates between 400 and 600 (inclusive); since only six estimates are more than 100 penguins away, 94% (94/100) of the estimates are within this range.
a. These are the estimates between 425 and 575 penguins (inclusive); the proportion is 84/100, since 84 of the 100 estimates are within this range.
b. The proportion is 16/100 (obtained as 1 – 84/100).
Here is the completed table:
a. When the proportion of estimates in an interval is high, it is a strong suggestion that the actual population value lies somewhere in that range.
b. A small interval gives greater precision to the estimates. If we can say that the actual value lies between 475 and 525, it is more meaningful than saying that the actual value lies, say, between 400 and 600.
c. As the interval range decreases, the proportion of estimates in that interval decreases. Thus, there is an important tradeoff: A wide interval will contain more estimates but will be less meaningful, whereas a small interval will be more meaningful but will contain fewer estimates.
a. The expected probability is 0.84, or 84%, since 84 of the 100 estimates fall in this interval.
b. The probability is 37%, since 37 of the estimates fall in the smallest interval (475 to 525).
c. It is very likely — 94% of the estimates fall in the interval within 100 penguins of the actual total (400 to 600).
Session 1 Statistics As Problem Solving
Consider statistics as a problem-solving process and examine its four components: asking questions, collecting appropriate data, analyzing the data, and interpreting the results. This session investigates the nature of data and its potential sources of variation. Variables, bias, and random sampling are introduced.
Session 2 Data Organization and Representation
Explore different ways of representing, analyzing, and interpreting data, including line plots, frequency tables, cumulative and relative frequency tables, and bar graphs. Learn how to use intervals to describe variation in data. Learn how to determine and understand the median.
Session 3 Describing Distributions
Continue learning about organizing and grouping data in different graphs and tables. Learn how to analyze and interpret variation in data by using stem and leaf plots and histograms. Learn about relative and cumulative frequency.
Session 4 Min, Max and the Five-Number Summary
Investigate various approaches for summarizing variation in data, and learn how dividing data into groups can help provide other types of answers to statistical questions. Understand numerical and graphic representations of the minimum, the maximum, the median, and quartiles. Learn how to create a box plot.
Session 5 Variation About the Mean
Explore the concept of the mean and how variation in data can be described relative to the mean. Concepts include fair and unfair allocations, and how to measure variation about the mean.
Session 6 Designing Experiments
Examine how to collect and compare data from observational and experimental studies, and learn how to set up your own experimental studies.
Session 7 Bivariate Data and Analysis
Analyze bivariate data and understand the concepts of association and co-variation between two quantitative variables. Explore scatter plots, the least squares line, and modeling linear relationships.
Session 8 Probability
Investigate some basic concepts of probability and the relationship between statistics and probability. Learn about random events, games of chance, mathematical and experimental probability, tree diagrams, and the binomial probability model.
Session 9 Random Sampling and Estimation
Learn how to select a random sample and use it to estimate characteristics of an entire population. Learn how to describe variation in estimates, and the effect of sample size on an estimate's accuracy.
Session 10 Classroom Case Studies, Grades K-2
Explore how the concepts developed in this course can be applied through a case study of a K-2 teacher, Ellen Sabanosh, a former course participant who has adapted her new knowledge to her classroom.
Session 11 Classroom Case Studies, Grades 3-5
Explore how the concepts developed in this course can be applied through case studies of a grade 3-5 teacher, Suzanne L'Esperance and grade 6-8 teacher, Paul Snowden, both former course participants who have adapted their new knowledge to their classrooms.