Learning Math: Data Analysis, Statistics, and Probability
Random Sampling and Estimation Part D: The Effect of Sample Size (30 minutes)
In This Part: Sample Size 20
All of our estimates thus far have been based on a sample size of 10 randomly selected sub-regions out of 100. In this part, we will examine the effects of changing the sample size to 20 sub-regions.
Here is a sequence of 20 random numbers selected by sampling without replacement:
81 48 66 94 87 60 51 30 92 97 00 41 27 12 38 64 93 79 50 59
Here is the corresponding sample of 20 sub-regions:
As before, we estimate the total number of penguins in the region by finding the mean of our samples, and then multiplying by 100 (the number of regions):
100 x [(5 + 6 + 5 + 6 + 3 + 7 + 4 + 5 + 5 + 7 + 5 + 5 + 4 + 4 + 5 + 6 + 7 + 4 + 5 + 4)/20] = 510
This estimate is very accurate (it is within 10 of the actual number of penguins). Let’s now investigate the effect that increasing the sample size has on the accuracy of our estimation procedure.
In This Part: Comparing Sample Sizes 10 and 20
In order to investigate whether samples of 20 sub-regions are more likely to produce better estimates than samples of 10 sub-regions, you will need to consider repeated sampling results for samples of size 20.
Here is the stem and leaf plot for 100 estimates of sample size 10:
Here is the stem and leaf plot for 100 estimates of sample size 20:
Compare the two distributions above. In particular, look at how many estimates for each fall in the interval 450 to < 550 (i.e., the 4H and 5L stems). What does this suggest about the effect of sample size on the accuracy of estimation?
Which distribution has more estimates “closer” to the actual answer of 500?
Now let’s revisit our table of intervals.
a. Use the 100 estimates from samples of size 20 to determine the proportion of estimates in each of the intervals.
In summary, as the sample size increases, the distribution of the estimates becomes more concentrated. Consequently, a larger sample size generally improves the accuracy of the estimation procedure.
b. Compare the proportions within the six intervals for the two different sample sizes. What does this suggest about the effect of sample size on the accuracy of the estimation procedure?
In This Part: Box Plot Comparisons
In the previous discussion, you investigated how increasing the sample size does two things:
• Decreases the sample-to-sample variation in the estimates
• Produces a higher proportion of estimates closer to the actual population size
We can also use another familiar method to explore this phenomenon: the Five-Number Summary and box plot.
Here is the stem and leaf plot for the 100 estimates from samples of size 10:
Use the stem and leaf plot to determine the Five-Number Summary for these estimates. These questions may help you along:
a. What is the position of the median, and which two values are used to calculate it?
b. If there are 50 values in each half, how are the quartiles calculated?
c. Complete the Five-Number Summary table:
Generate the Five-Number Summary for this stem and leaf plot of the 100 estimates based on samples of size 20:
Since the number of estimates is the same as Problem D3’s, the quartiles and median will be in the same positions. Count the values in increasing order to find them.
Create two box plots for the Five-Number Summaries you generated in Problems D3 and D4, placing them side by side on the same scale to make them easier to compare.
What do the box plots suggest about the effect of sample size on the accuracy of the estimates? In particular, how do the box plots illustrate the following:
a. How much the estimates vary from sample to sample
b. How close the estimates are to the actual value of 500
In this video segment, the participants discuss what percentages of their data fell in particular interval ranges for samples of size 10 and 20. Professor Kader then introduces the Central Limit Theorem to further discuss the connection between probability and statistics. What is the give-and-take between selecting an interval range and sample size when designing a statistical investigation? How would you use this information to plan a statistical investigation? How can you be more precise when taking a sample size? How can you be more accurate?
There are more estimates from the distribution for sample size 20 that fall in the 4H and 5L stems (i.e., in the range 450-549). This suggests that the estimates from 20 sub-regions are more accurate.
a. Here is the completed table:
b. Each interval of the samples of 20 sub-regions contains a higher proportion of estimates. For instance, the interval 450-550 contains 83/100 samples of size 20, compared to 69/100 samples of size 10. A higher proportion of the estimates falls within 50 penguins of the actual population size (500) when samples of size 20 were used. This suggests that the increased sample size has a significant effect on the accuracy of the estimates.
a. The median is in position (100 + 1)/2 = 50.5, so it is the average of the 50th and 51st values in the ordered list. Each of these values is 500.
b. The quartiles will be at position (50 + 1)/2 = 25.5, so they are the average of the 25th and 26th values in their respective halves.
c. Here is the completed table:
Here is the completed table:
Here are the completed box plots:
a. The sample-to-sample variation goes down as the sample size increases. This is exhibited by the shrinking box portion of the graphs.
b. The estimates are closer to the actual value as the sample size increases. Both the range and the interquartile range decrease significantly from the estimates using sample size 10 and sample size 20.
Session 1 Statistics As Problem Solving
Consider statistics as a problem-solving process and examine its four components: asking questions, collecting appropriate data, analyzing the data, and interpreting the results. This session investigates the nature of data and its potential sources of variation. Variables, bias, and random sampling are introduced.
Session 2 Data Organization and Representation
Explore different ways of representing, analyzing, and interpreting data, including line plots, frequency tables, cumulative and relative frequency tables, and bar graphs. Learn how to use intervals to describe variation in data. Learn how to determine and understand the median.
Session 3 Describing Distributions
Continue learning about organizing and grouping data in different graphs and tables. Learn how to analyze and interpret variation in data by using stem and leaf plots and histograms. Learn about relative and cumulative frequency.
Session 4 Min, Max and the Five-Number Summary
Investigate various approaches for summarizing variation in data, and learn how dividing data into groups can help provide other types of answers to statistical questions. Understand numerical and graphic representations of the minimum, the maximum, the median, and quartiles. Learn how to create a box plot.
Session 5 Variation About the Mean
Explore the concept of the mean and how variation in data can be described relative to the mean. Concepts include fair and unfair allocations, and how to measure variation about the mean.
Session 6 Designing Experiments
Examine how to collect and compare data from observational and experimental studies, and learn how to set up your own experimental studies.
Session 7 Bivariate Data and Analysis
Analyze bivariate data and understand the concepts of association and co-variation between two quantitative variables. Explore scatter plots, the least squares line, and modeling linear relationships.
Session 8 Probability
Investigate some basic concepts of probability and the relationship between statistics and probability. Learn about random events, games of chance, mathematical and experimental probability, tree diagrams, and the binomial probability model.
Session 9 Random Sampling and Estimation
Learn how to select a random sample and use it to estimate characteristics of an entire population. Learn how to describe variation in estimates, and the effect of sample size on an estimate's accuracy.
Session 10 Classroom Case Studies, Grades K-2
Explore how the concepts developed in this course can be applied through a case study of a K-2 teacher, Ellen Sabanosh, a former course participant who has adapted her new knowledge to her classroom.
Session 11 Classroom Case Studies, Grades 3-5
Explore how the concepts developed in this course can be applied through case studies of a grade 3-5 teacher, Suzanne L'Esperance and grade 6-8 teacher, Paul Snowden, both former course participants who have adapted their new knowledge to their classrooms.