Learning Math: Data Analysis, Statistics, and Probability
Designing Experiments Part B: Comparative Observational Studies (35 minutes)
In This Part: A New Raisin Question
Let’s begin our investigation of comparative observational studies by returning to the raisin problem from Session 2 and the homework in Session 4. In this session, you will return to the issue of comparing two different brands of raisins: When the weights of the boxes are the same, how do the number of raisins in each box compare between the two brands.
1. Ask a Question
How do the number of raisins in boxes of Brand C and Brand D compare?
2. Collect Appropriate Data
We counted 28 boxes of Brand C raisins and 36 boxes of Brand D raisins. Here are the ordered raisin counts for boxes of Brand C and Brand D raisins:
3. Analyze the Data
Here are the mean and median counts for each brand:
According to these data, Brand D typically has a few more raisins than Brand C. On average, Brand D has two more raisins than Brand C, and the median number of Brand D raisins (29) is one more than the median number of Brand C raisins (28).
Based on the means and medians, you might conclude that the number of raisins in a box is about the same for both brands. Although it is useful to look at the means and medians, there are other aspects of the distribution you might want to consider.
Why is this raisin study observational as opposed to experimental?
In This Part: Using Five-Number Summaries and Box Plot
Comparing two sets of measurements is not quite as simple as comparing two numbers. Because we are comparing a set of 28 measurements for Brand C with a set of 36 measurements for Brand D, any comparison must be based on percentages and not absolute frequencies. A comparison of the Five-Number Summaries is useful, since these quantities divide the ordered data into four groups, with approximately 25% of the data in each group. Here are the Five-Number Summaries for these data: See
Here are the comparative box plots for these data:
You might start by comparing the actual values in the Five-Number Summaries. This will tell you where one set of measurements is located relative to the other set:
Note that with the exception of the minimum values, all summary measures for Brand D are higher than for Brand C. This suggests that boxes of Brand D tend to have more raisins than boxes of Brand C. In fact, since the third quartile for Brand D is greater than the maximum for Brand C, more than 25% of the boxes of Brand D have more raisins than any boxes of Brand C.
In This Part: The Interquartile Range
Your comparison of the two sets of measurements should also consider the degree of variation within each set. This comparison can be based on the range of all the data (Max – Min) as well as on the range of the middle half of the data (which is called the Q-Spread or Interquartile Range, or simply the IQR), which is found by subtracting Q1 from Q3.
Let’s look at the box plots again, and then calculate the range and the IQR:
Consequently, although Brand D tends to have more raisins per box than Brand C, the smaller range and IQR for Brand C tell us that Brand C is more consistent than Brand D. Since the weights of boxes are the same, this would also suggest that the sizes of the raisins vary less for Brand C.
Brand A raisins come in boxes of the same weight as Brands C and D. Here are the ordered counts for 30 boxes of Brand A raisins:
Compare the counts for Brands A and D. Make sure you consider where the data are located and the degree of variation. (You may have already determined the Five-Number Summary in Session 4.)
Brand B raisins come in boxes of the same weight as Brands A, C, and D. Here are the ordered counts for 27 boxes of Brand B raisins:
Compare the counts for three brands: A, B, and C. Make sure you consider where the data are located and the degree of variation. (You may have already determined the Five-Number Summary in Session 4.)
Most people do not have difficulty comparing a single number with a single number, for example, noting that the median of one set of counts is greater than the median of another, or comparing one upper quartile with another. Some people, though, may have difficulty in comparing the distribution of one set of counts with the distribution of another.
To compare one Five-Number Summary with another in the proper way requires a composite comparison of five numbers to five numbers; you must think beyond single-number comparisons. The box plots help to clarify this comparison, especially the variation within a group as indicated by the range and the interquartile range.
The raisin studies are observational because they observe the objects (raisins) as they are. There is no treatment deliberately imposed on any group of raisins, so there is no “cause and effect” to study.
Here are the Five-Number Summaries and box plots for each brand:
Here are the Five-Number Summaries for each brand:
The Five-Number Summaries for Brands A, B, and C suggest that Brand B has the fewest raisins in general. It has the smallest median (26), the smallest minimum (17), and the smallest maximum (30). Brand C has the least total variation and the highest minimum (25). Brand A has the most raisins in general, having the largest median (29.5) and by far the largest maximum (39); it also has the greatest variation.
The box plots indicate that the two sets of counts are very similar. The location indicators are all about the same: The Mins and Q1s are exactly the same, and the Meds, Q3s, and Maxes differ by 0.5, 1, and 1, respectively, which are not large differences relative to the magnitudes of the numbers we are comparing.
The degree of variation is similar for the two brands. The ranges for Brands A and D are 16 and 15, respectively, and the IQRs are 5 and 6.
Session 1 Statistics As Problem Solving
Consider statistics as a problem-solving process and examine its four components: asking questions, collecting appropriate data, analyzing the data, and interpreting the results. This session investigates the nature of data and its potential sources of variation. Variables, bias, and random sampling are introduced.
Session 2 Data Organization and Representation
Explore different ways of representing, analyzing, and interpreting data, including line plots, frequency tables, cumulative and relative frequency tables, and bar graphs. Learn how to use intervals to describe variation in data. Learn how to determine and understand the median.
Session 3 Describing Distributions
Continue learning about organizing and grouping data in different graphs and tables. Learn how to analyze and interpret variation in data by using stem and leaf plots and histograms. Learn about relative and cumulative frequency.
Session 4 Min, Max and the Five-Number Summary
Investigate various approaches for summarizing variation in data, and learn how dividing data into groups can help provide other types of answers to statistical questions. Understand numerical and graphic representations of the minimum, the maximum, the median, and quartiles. Learn how to create a box plot.
Session 5 Variation About the Mean
Explore the concept of the mean and how variation in data can be described relative to the mean. Concepts include fair and unfair allocations, and how to measure variation about the mean.
Session 6 Designing Experiments
Examine how to collect and compare data from observational and experimental studies, and learn how to set up your own experimental studies.
Session 7 Bivariate Data and Analysis
Analyze bivariate data and understand the concepts of association and co-variation between two quantitative variables. Explore scatter plots, the least squares line, and modeling linear relationships.
Session 8 Probability
Investigate some basic concepts of probability and the relationship between statistics and probability. Learn about random events, games of chance, mathematical and experimental probability, tree diagrams, and the binomial probability model.
Session 9 Random Sampling and Estimation
Learn how to select a random sample and use it to estimate characteristics of an entire population. Learn how to describe variation in estimates, and the effect of sample size on an estimate's accuracy.
Session 10 Classroom Case Studies, Grades K-2
Explore how the concepts developed in this course can be applied through a case study of a K-2 teacher, Ellen Sabanosh, a former course participant who has adapted her new knowledge to her classroom.
Session 11 Classroom Case Studies, Grades 3-5
Explore how the concepts developed in this course can be applied through case studies of a grade 3-5 teacher, Suzanne L'Esperance and grade 6-8 teacher, Paul Snowden, both former course participants who have adapted their new knowledge to their classrooms.