Learning Math: Data Analysis, Statistics, and Probability
Bivariate Data and Analysis Part D: Fitting Lines to Data (60 minutes)
In This Part: Trend Lines
In Parts A and B, you confirmed that there is a strong positive association between height and arm span — short people tend to have short arms, and tall people tend to have long arms. In Part C, you investigated the nature of the relationship between height and arm span by graphing the line Height = Arm Span on a scatter plot of collected data. In Part D, using the same data you’ve been working with, you will investigate the use of other lines as potential models for describing the relationship between height and arm span, and you will explore various criteria for selecting the best line.
Again, here is the scatter plot of the 24 people’s data:
Describe the trend in the data points — in other words, how would you describe the general positioning of the points in the scatter plot? What does this trend tell you about the relationship between height and arm span?
Now let’s take another look at the scatter plot with the line Height = Arm Span graphed:
a. Does this line generally provide an accurate description of the trend in the scatter plot?
b. Do you think there might be a better line for describing this trend?
Let’s consider two other lines for describing the relationship between Height and Arm Span:
Height = Arm Span + 1
Height = Arm Span – 1
The following scatter plot includes the graphs of all three lines:
Based on a visual inspection, which of these three lines does the best job of describing the trend in the data points? Explain why you chose this line.
In This Part: Error
You should have decided in Problem D3 that two of the three lines are better candidates for describing the trend in the data points. The line Height = Arm Span has nine points that are above the line, three that are on the line, and 12 that are below the line. The line Height = Arm Span – 1 has 12 points that are above the line, four that are on the line, and eight that are below the line.
So which of these lines is “better” at describing the relationship? While personal judgement is useful, statisticians prefer to use more objective methods. To develop criteria for identifying the “better” line, we’ll use a concept developed in Part C: the vertical distance from a point to a line.
Person 11, whose arm span is 173 cm and whose height is 185 cm, is represented by the point (173, 185) in the scatter plot. If you were to use the line to predict person 11’s height based on his or her arm span, the predicted values would be represented by the point (173, 173), which lies on the line Height = Arm Span. The scatter plot thus far looks like this:
The difference between the actual observed height (Y) and the corresponding hypothetical, predicted height (on the line) is called the error. If we use YL (Y on the line) to designate the Y coordinate that represents the predicted height, then we can calculate the error as follows:
Error = Y – YL
In other words, Error = Actual Observed Height – Predicted Height (on the line).
Finally, the vertical distance between an observed height and a predicted height can be expressed as:
Distance = |Y – YL| = |Error|
Let’s see how this works for the line Height = Arm Span (i.e., YL = X).
The following table shows the arm span (X), the actual observed height (Y), the predicted height based on the line Height = Arm Span (i.e., YL = X), the error, and the vertical distance between the person’s observed height (Y) and predicted height (YL) for Persons 1 through 6 in our study:
Fathom Software, used by the participants in the video segments, is helpful in creating graphical representations of data. If you try the problems in Part D using Fathom, you will be able to test various slopes and intercepts. For more information on Fathom, go to the Key Curriculum Press Web site at www.keypress.com/fathom/.
More advanced presentations of this topic use such ideas as the standard deviation around the regression line and the coefficient R-Squared. The data in this session has been structured so that using the sum of squares for comparison gives a reasonable result.
Overall, there is an upward trend; that is, the points generally go up and to the right. This corresponds to the positive association between height and arm span.
a. The line does a reasonably good job. Some points are above the line, some are below it, and some are on the line, but all are generally pretty close.
b. It looks like it may be possible for another line to be, overall, “closer” to these points.
Answers will vary. The lines Height = Arm Span and Height = Arm Span – 1 each seem to do a good job of dividing the points fairly evenly above and below the line, and matching the overall trend of data. It is difficult to distinguish between them without a more mathematical test. Each is clearly better than Height = Arm Span + 1, which lies above a majority of the points.
Here is the completed table:
The sum of squared errors (SSE) is 49 + 16 + … + 169 = 772. Since this is less than the sum of squared errors for the line Height = Arm Span (which was 784), the line Height = Arm Span – 1 is a slightly better fit.
a. The best model is YL = X – .7, because it has the smallest SSE. The worst model is YL = X + 1, because it has the largest SSE.
b. As all of these lines have the same slope, if we changed the slope, we might find ways to reduce the SSE.
c. No, we cannot reduce the SSE to zero unless all the data points lie on a straight line, which these 24 points clearly do not do.
Session 1 Statistics As Problem Solving
Consider statistics as a problem-solving process and examine its four components: asking questions, collecting appropriate data, analyzing the data, and interpreting the results. This session investigates the nature of data and its potential sources of variation. Variables, bias, and random sampling are introduced.
Session 2 Data Organization and Representation
Explore different ways of representing, analyzing, and interpreting data, including line plots, frequency tables, cumulative and relative frequency tables, and bar graphs. Learn how to use intervals to describe variation in data. Learn how to determine and understand the median.
Session 3 Describing Distributions
Continue learning about organizing and grouping data in different graphs and tables. Learn how to analyze and interpret variation in data by using stem and leaf plots and histograms. Learn about relative and cumulative frequency.
Session 4 Min, Max and the Five-Number Summary
Investigate various approaches for summarizing variation in data, and learn how dividing data into groups can help provide other types of answers to statistical questions. Understand numerical and graphic representations of the minimum, the maximum, the median, and quartiles. Learn how to create a box plot.
Session 5 Variation About the Mean
Explore the concept of the mean and how variation in data can be described relative to the mean. Concepts include fair and unfair allocations, and how to measure variation about the mean.
Session 6 Designing Experiments
Examine how to collect and compare data from observational and experimental studies, and learn how to set up your own experimental studies.
Session 7 Bivariate Data and Analysis
Analyze bivariate data and understand the concepts of association and co-variation between two quantitative variables. Explore scatter plots, the least squares line, and modeling linear relationships.
Session 8 Probability
Investigate some basic concepts of probability and the relationship between statistics and probability. Learn about random events, games of chance, mathematical and experimental probability, tree diagrams, and the binomial probability model.
Session 9 Random Sampling and Estimation
Learn how to select a random sample and use it to estimate characteristics of an entire population. Learn how to describe variation in estimates, and the effect of sample size on an estimate's accuracy.
Session 10 Classroom Case Studies, Grades K-2
Explore how the concepts developed in this course can be applied through a case study of a K-2 teacher, Ellen Sabanosh, a former course participant who has adapted her new knowledge to her classroom.
Session 11 Classroom Case Studies, Grades 3-5
Explore how the concepts developed in this course can be applied through case studies of a grade 3-5 teacher, Suzanne L'Esperance and grade 6-8 teacher, Paul Snowden, both former course participants who have adapted their new knowledge to their classrooms.