## Learning Math: Data Analysis, Statistics, and Probability

# Random Sampling and Estimation Part B: Selecting the Sample (30 minutes)

**In This Part:**** Fair Sampling**

You may have noticed that your estimates for the total penguin population vary quite a bit based on both the sample size and which sub-regions were sampled. The decision about how to select a sample, accordingly, is a critical one in statistics. It is important that each part of the population be treated fairly. If you are fair in the selection, then you should obtain a representative sample and thus a more fair estimation procedure.

In earlier sessions, you looked at notions of fairness and randomness and noticed that people have a difficult time being fair or random. So what methods can you use to accomplish fair sampling? See Note 4 below.

**Problem B1
**How might you select 10 sub-regions from the 100 total sub-regions so that you would be most likely to have a “representative” sample for estimating the size of the penguin population in the entire region? You can use the empty chart below to explore your ideas.

To select 10 sub-regions from the 100 total sub-regions in a “fair” way requires that each of the 100 sub-regions has the same chance of being selected. You can accomplish this with random selection. How might you select 10 sub-regions in a random fashion?

**Video Segment**

In this video segment, groups of participants devise methods for collecting a random sample of penguins. Watch this segment after you have completed Problem B1 and compare your method with that of the onscreen participants. Do these methods ensure that the samples will be random?

**In This Part****: A Fair Sampling Method
**There are many different ways to randomly select 10 sub-regions. Many of these methods involve initially numbering the 100 sub-regions. In this section, we will use the numbering system below, which numbers the sub-regions from 00 through 99:

Locating number positions is easier if we put digits on the outside borders as shown. Each number in the grid corresponds to a red and blue number combination; the red number is the first digit, and the blue number is the second digit:

**Problem B2
**Think of a way to pick 10 numbers between 00 and 99 at random. (You may prefer to select each digit individually, or to select the entire two-digit number at once.) Then use your method to generate the 10 random numbers.

You may wish to use a random-number-generating device, such as a calculator, a 10-sided die, or computer software, to generate the random numbers.

One possible method for solving Problem B2 is to use two 10-sided dice, one red and one blue. The sub-region can then be determined by the two dice (in the order red, and then blue).

You might notice that the random selection process will sometimes produce duplicates. There is a greater than one-third chance that 10 numbers picked at random between 00 and 99 will produce at least one duplicate, and almost a 90% chance that 20 such numbers will produce at least one duplicate.

For instance, you might find that seven tosses of the dice produced these sub-region choices:

19 22 39 50 34 05 39

If we do not want duplicates, we can skip them until we get 10 distinct numbers, for example:

19 22 39 50 34 05 75 62 87 13

This is called sampling without replacement, since each time we choose a sub-region we remove it from the list of sub-regions we can choose on the next toss of the dice. In some experiments, it may be impractical or impossible to exclude duplicates from the random selection process. If duplicates are allowed, it is called sampling with replacement.

The 10 distinct numbers (19, 22, 39, 50, 34, 05, 75, 62, 87, 13) correspond to these 10 sub-regions:

Here is a look at the number of penguins in each of the 10 sub-regions we selected:

The estimate of the total number of penguins for the entire region based on this random sample of 10 sub-regions is as follows:

100 x [(5 + 6 + 6 + 7 + 5 + 2 + 1+ 5 + 5 + 3)/10] = 100 x (45/10) = 450

**Problem B3
**Use the random sample you found in Problem B2 to estimate the total number of penguins in the region. Find your 10 random sub-regions in the chart below:

**Problem B4**

Did you expect your estimate from Problem B3 to equal your estimate from Problem B2? Why or why not? What explains this variation? If the sample size were increased to 20 sub-regions, would you expect the variation in the estimates to increase or decrease? Why?

**In This Part****: Variation in Estimates
**A computer can perform random sampling and estimation much more quickly than you can by hand. Here are three more random samples of 10 sub-regions generated by a computer.

### Notes

**Note 4
**Take time to develop your own ideas. There are many different ways to randomly select 10 sub-regions. Developing a method of selection will help you clarify the concept as well as provide a tool for the practice of sampling. After you have considered your own methods, you can then investigate the specific methods introduced in Part B.

### Solutions

**Problem B1**

Answers will vary, as there are many possible ways to do this. One possibility is to take the 100 pictures of the sub-regions, shuffle them thoroughly, then look at the first 10. Another is to assign each sub-region to a number from 00 to 99, and use the last two digits of the daily lottery number for each of the last 10 days. A commonly used method for assigning regions to numbers is to use a random-number-generating device, such as a calculator, a die, or computer software.

**Problem B2**

With a calculator, the first two decimal digits of the random number will range from 00 to 99, and each of the 100 values is equally likely. If a number appears more than once, it is rejected, so that 10 different sub-regions are selected. Another idea is to use a 10-sided die or spinner and to generate two random digits by two tosses or spins (and get your 10 random numbers by 20 tosses or spins).

**Problem B3**

Answers will vary, depending on which region you selected in Problem B2. As an example, the random sequence (96, 74, 61, 21, 49, 37, 82, 35, 18, 68) determines this sample of 10 sub-regions:

The estimate of the total number of penguins is

100 x [(5 + 4 + 4 + 6 + 4 + 5 + 6 +5 + 3 + 7)/10] = 100 x (49/10) = 490.

**Problem B4
**While it is possible for the two estimates to be equal, it is pretty unlikely, due to the variation in the individual sub-regions. If the number of sub-regions in the sample increases to 20, the variation in the estimates should be reduced. The estimates should be closer to the actual value, but it is no more likely that they will be equal.

**Problem B5**

Answers will vary. To determine how many penguins there are in the region, you might calculate the mean or median of the set of five estimates.