Private: Learning Math: Data Analysis, Statistics, and Probability
Describing Distributions Part B: Histograms (30 minutes)
In This Part: Constructing a Histogram
Like the line plot we explored in Session 2, the stem and leaf plot is a useful device for illustrating variation in data for small data sets (up to about 100 values). For larger data sets, though, the stem and leaf plot is not a practical way to organize data. Instead, you might want to use a histogram. See
Let’s start with the stem and leaf plot for a new data set: 52 estimates collected in answer to the question “How long is a minute?”:
If the stem and leaf plot is rotated 90° counterclockwise, it looks like this:
To create a histogram for this data, first replace each “leaf” (second digit) with a dot:
While a histogram is similar to a line plot, there are, in fact, differences in the values across the horizontal axis. In a line plot, these numbers represent a single data value. In the plot above, the numbers across the bottom indicate the stems in the original stem and leaf plot. Each number represents an entire interval of values.
For instance, the “3” denotes the stem for all values in the 30s — that is, the interval (range) of values from 30 up to (but not including) 40. For the purposes of a histogram, it is useful to label this interval “30 to less than 40” (30 to < 40) to remind us that 30 is included but 40 is not.
The “4” denotes the stem for all values in the 40s — that is, the interval (range) of values from 40 up to (but not including) 50. Again, it is helpful to label this interval “40 to < 50” to remind us that 40 is included but 50 is not.
If we re-label the horizontal axis to show these intervals (groups) of data, the graph below is produced. Again, this graph is similar to a line plot except that the horizontal axis indicates intervals of data values instead of individual data values:
A grouped frequency table can be determined from this display in the following manner:
- There are four dots over the first group in the interval 30 to < 40. This group has frequency 4.
- There is one dot over the second group in the interval 40 to < 50. This group has frequency 1.
If we continue this process for the other groups, we produce the following grouped frequency table:
Remember that this table describes ranges of data values rather than specific data values. For instance, we can see that there are seven responses in the interval 70 to < 80, but we have no idea what the actual values are for those responses.
In This Part: Completing the Histogram
You are now ready to complete the histogram based on the line plot you created:
- Draw a rectangle over each value on the horizontal axis with a height corresponding to the frequency of that value:
(Note that the frequency of each value on the horizontal axis is still indicated by the number of dots within each rectangle.)
- Remove the dots, shade the rectangles, and add a vertical scale to indicate the frequency of each interval on the horizontal scale:
You have just created a frequency histogram!
The following Interactive Activity reviews the transitions between the various displays of data you’ve worked with so far in this session, using the 26 data values from Part A. (This interactive has been disabled.)
What advantages does a histogram have over a stem and leaf plot? What are the disadvantages of a histogram?
Just as with groupings in a stem and leaf plot, you can change the size of the intervals in a histogram depending on the situation. Your goal is to organize your representation so that you can present the data in the most meaningful way.
In This Part: Interpreting a Histogram
The histogram and grouped frequency table you just created offer different ways to present your data (the time estimates) and provide different ways to answer our original question, “How well do people judge when a minute has elapsed?”
Using only the histogram and grouped frequency table, give two descriptive statements that provide an answer to this question. (Since the goal is to estimate when a minute has elapsed, it would make sense to again consider how close the estimates are to the correct response which is 60 seconds.)
a. According to the histogram and grouped frequency table, how many people’s estimates were outside the interval from 50 to less than 70 seconds? That is, how many estimates were less than 50 seconds or 70 seconds or more?
b. How many estimates were within the interval from 50 to less than 70 seconds?
c. How many estimates were outside the interval from 40 to less than 80 seconds?
d. In Problem A5, only nine people’s estimates were more than 10 seconds away from one minute. Does your answer to question (a) of this problem imply that the people in this group were not as good at estimating a minute’s time? If so, why? If not, how could you make a fairer comparison between the two sets?
The second data set comes from a group of 52 time estimates. How many were in the first group?
You will take an evolutionary approach to developing the histogram. The objective of this activity is for you to see the relationship between the line plot, the stem and leaf plot, and the histogram. The line plot shows the frequencies of the rows, but not the actual data values. The stem and leaf plot contains more detailed information than the histogram in that all of the data values are shown. And finally, the relative frequency histogram shows the relative sizes of the frequencies for each interval, although it does not explicitly show those frequencies.
A histogram offers a better graphical perspective on an entire data set. One disadvantage is that the actual data values cannot be determined from a histogram, only the number of values within intervals.
There are many descriptive statements that could provide an answer to this question. Here are some things you may have noted:
- All estimates are between 30 seconds and 100 seconds. The range is 70 seconds, which indicates a lot of variation in the estimates.
- There is a concentration of estimates between 50 seconds and 70 seconds. Thirty-five of the 52 estimates (or 35 / 52 = 67.3%) fall within this interval. The range of this interval is only 20 seconds.
- You may have noticed that because the histogram does not indicate individual pieces of data, we cannot look for a single number that represents the data.
a. There are five estimates below 50 seconds and 12 estimates of 70 seconds or higher. In total, 17 of 52 estimates were outside the interval from 50 to less than 70 seconds.
b. Since 17 estimates were outside this interval, the remaining 35 of 52 estimates were within the interval.
c. There are four estimates below 40 seconds and five estimates of 80 seconds or higher. In total, nine of 52 estimates were outside this interval.
d. No, the answer to question (a) suggests that this group was roughly in line with the original group, since there were only 26 responses in the original group. The proportion for this group, 17 / 52 = 32.7%, is only slightly better than the proportion for the original group, which was 9 / 26 = 34.6%. Effective comparisons between groups of different sizes must be relative comparisons.
Session 1 Statistics As Problem Solving
Consider statistics as a problem-solving process and examine its four components: asking questions, collecting appropriate data, analyzing the data, and interpreting the results. This session investigates the nature of data and its potential sources of variation. Variables, bias, and random sampling are introduced.
Session 2 Data Organization and Representation
Explore different ways of representing, analyzing, and interpreting data, including line plots, frequency tables, cumulative and relative frequency tables, and bar graphs. Learn how to use intervals to describe variation in data. Learn how to determine and understand the median.
Session 3 Describing Distributions
Continue learning about organizing and grouping data in different graphs and tables. Learn how to analyze and interpret variation in data by using stem and leaf plots and histograms. Learn about relative and cumulative frequency.
Session 4 Min, Max and the Five-Number Summary
Investigate various approaches for summarizing variation in data, and learn how dividing data into groups can help provide other types of answers to statistical questions. Understand numerical and graphic representations of the minimum, the maximum, the median, and quartiles. Learn how to create a box plot.
Session 5 Variation About the Mean
Explore the concept of the mean and how variation in data can be described relative to the mean. Concepts include fair and unfair allocations, and how to measure variation about the mean.
Session 6 Designing Experiments
Examine how to collect and compare data from observational and experimental studies, and learn how to set up your own experimental studies.
Session 7 Bivariate Data and Analysis
Analyze bivariate data and understand the concepts of association and co-variation between two quantitative variables. Explore scatter plots, the least squares line, and modeling linear relationships.
Session 8 Probability
Investigate some basic concepts of probability and the relationship between statistics and probability. Learn about random events, games of chance, mathematical and experimental probability, tree diagrams, and the binomial probability model.
Session 9 Random Sampling and Estimation
Learn how to select a random sample and use it to estimate characteristics of an entire population. Learn how to describe variation in estimates, and the effect of sample size on an estimate's accuracy.
Session 10 Classroom Case Studies, Grades K-2
Explore how the concepts developed in this course can be applied through a case study of a K-2 teacher, Ellen Sabanosh, a former course participant who has adapted her new knowledge to her classroom.
Session 11 Classroom Case Studies, Grades 3-5
Explore how the concepts developed in this course can be applied through case studies of a grade 3-5 teacher, Suzanne L'Esperance and grade 6-8 teacher, Paul Snowden, both former course participants who have adapted their new knowledge to their classrooms.