# Random Sampling and Estimation Part D: The Effect of Sample Size (30 minutes)

In This Part: Sample Size 20
All of our estimates thus far have been based on a sample size of 10 randomly selected sub-regions out of 100. In this part, we will examine the effects of changing the sample size to 20 sub-regions.

Here is a sequence of 20 random numbers selected by sampling without replacement:

81 48 66 94 87 60 51 30 92 97 00 41 27 12 38 64 93 79 50 59

Here is the corresponding sample of 20 sub-regions:

As before, we estimate the total number of penguins in the region by finding the mean of our samples, and then multiplying by 100 (the number of regions):

100 x [(5 + 6 + 5 + 6 + 3 + 7 + 4 + 5 + 5 + 7 + 5 + 5 + 4 + 4 + 5 + 6 + 7 + 4 + 5 + 4)/20] = 510

This estimate is very accurate (it is within 10 of the actual number of penguins). Let’s now investigate the effect that increasing the sample size has on the accuracy of our estimation procedure.

In This Part: Comparing Sample Sizes 10 and 20
In order to investigate whether samples of 20 sub-regions are more likely to produce better estimates than samples of 10 sub-regions, you will need to consider repeated sampling results for samples of size 20.

Here is the stem and leaf plot for 100 estimates of sample size 10:

Here is the stem and leaf plot for 100 estimates of sample size 20:

Problem D1
Compare the two distributions above. In particular, look at how many estimates for each fall in the interval 450 to < 550 (i.e., the 4H and 5L stems). What does this suggest about the effect of sample size on the accuracy of estimation?

Which distribution has more estimates “closer” to the actual answer of 500?

Problem D2
a. Use the 100 estimates from samples of size 20 to determine the proportion of estimates in each of the intervals.

In summary, as the sample size increases, the distribution of the estimates becomes more concentrated. Consequently, a larger sample size generally improves the accuracy of the estimation procedure.

b. Compare the proportions within the six intervals for the two different sample sizes. What does this suggest about the effect of sample size on the accuracy of the estimation procedure?

In This Part: Box Plot Comparisons
In the previous discussion, you investigated how increasing the sample size does two things:
• Decreases the sample-to-sample variation in the estimates
• Produces a higher proportion of estimates closer to the actual population size

We can also use another familiar method to explore this phenomenon: the Five-Number Summary and box plot.

Problem D3
Here is the stem and leaf plot for the 100 estimates from samples of size 10:

Use the stem and leaf plot to determine the Five-Number Summary for these estimates. These questions may help you along:
a. What is the position of the median, and which two values are used to calculate it?
b.
If there are 50 values in each half, how are the quartiles calculated?
c. Complete the Five-Number Summary table:

Problem D4
Generate the Five-Number Summary for this stem and leaf plot of the 100 estimates based on samples of size 20:

Since the number of estimates is the same as Problem D3’s, the quartiles and median will be in the same positions. Count the values in increasing order to find them.

Problem D5
Create two box plots for the Five-Number Summaries you generated in Problems D3 and D4, placing them side by side on the same scale to make them easier to compare.

Problem D6
What do the box plots suggest about the effect of sample size on the accuracy of the estimates? In particular, how do the box plots illustrate the following:
a. How much the estimates vary from sample to sample
b.
How close the estimates are to the actual value of 500

Video Segment
In this video segment, the participants discuss what percentages of their data fell in particular interval ranges for samples of size 10 and 20. Professor Kader then introduces the Central Limit Theorem to further discuss the connection between probability and statistics. What is the give-and-take between selecting an interval range and sample size when designing a statistical investigation? How would you use this information to plan a statistical investigation? How can you be more precise when taking a sample size? How can you be more accurate?

### Solutions

Problem D1
There are more estimates from the distribution for sample size 20 that fall in the 4H and 5L stems (i.e., in the range 450-549). This suggests that the estimates from 20 sub-regions are more accurate.

Problem D2
a.
Here is the completed table:

b. Each interval of the samples of 20 sub-regions contains a higher proportion of estimates. For instance, the interval 450-550 contains 83/100 samples of size 20, compared to 69/100 samples of size 10. A higher proportion of the estimates falls within 50 penguins of the actual population size (500) when samples of size 20 were used. This suggests that the increased sample size has a significant effect on the accuracy of the estimates.

Problem D3
a.
The median is in position (100 + 1)/2 = 50.5, so it is the average of the 50th and 51st values in the ordered list. Each of these values is 500.
b.
The quartiles will be at position (50 + 1)/2 = 25.5, so they are the average of the 25th and 26th values in their respective halves.
c.
Here is the completed table:

Problem D4

Here is the completed table:

Problem D5
Here are the completed box plots:

Problem D6
a.
The sample-to-sample variation goes down as the sample size increases. This is exhibited by the shrinking box portion of the graphs.
b.
The estimates are closer to the actual value as the sample size increases. Both the range and the interquartile range decrease significantly from the estimates using sample size 10 and sample size 20.