## Learning Math: Data Analysis, Statistics, and Probability

# Bivariate Data and Analysis Part D: Fitting Lines to Data (60 minutes)

**In This Par****t: Trend Lines
**In Parts A and B, you confirmed that there is a strong positive association between height and arm span — short people tend to have short arms, and tall people tend to have long arms. In Part C, you investigated the nature of the relationship between height and arm span by graphing the line Height = Arm Span on a scatter plot of collected data. In Part D, using the same data you’ve been working with, you will investigate the use of other lines as potential models for describing the relationship between height and arm span, and you will explore various criteria for selecting the best line.

Again, here is the scatter plot of the 24 people’s data:

**
Problem D1
**Describe the trend in the data points — in other words, how would you describe the general positioning of the points in the scatter plot? What does this trend tell you about the relationship between height and arm span?

Now let’s take another look at the scatter plot with the line Height = Arm Span graphed:

**Problem D2
a. **Does this line generally provide an accurate description of the trend in the scatter plot?

**Do you think there might be a better line for describing this trend?**

b.

b.

Let’s consider two other lines for describing the relationship between Height and Arm Span:

Height = Arm Span + 1

Height = Arm Span – 1

The following scatter plot includes the graphs of all three lines:

**Problem D3
**Based on a visual inspection, which of these three lines does the best job of describing the trend in the data points? Explain why you chose this line.

**In This Part: Error
**You should have decided in Problem D3 that two of the three lines are better candidates for describing the trend in the data points. The line Height = Arm Span has nine points that are above the line, three that are on the line, and 12 that are below the line. The line Height = Arm Span – 1 has 12 points that are above the line, four that are on the line, and eight that are below the line.

So which of these lines is “better” at describing the relationship? While personal judgement is useful, statisticians prefer to use more objective methods. To develop criteria for identifying the “better” line, we’ll use a concept developed in Part C: the vertical distance from a point to a line.

Person 11, whose arm span is 173 cm and whose height is 185 cm, is represented by the point (173, 185) in the scatter plot. If you were to use the line to predict person 11’s height based on his or her arm span, the predicted values would be represented by the point (173, 173), which lies on the line Height = Arm Span. The scatter plot thus far looks like this:

The difference between the actual observed height (Y) and the corresponding hypothetical, predicted height (on the line) is called the error. If we use YL (Y on the line) to designate the Y coordinate that represents the predicted height, then we can calculate the error as follows:

Error = Y – YL

In other words, Error = Actual Observed Height – Predicted Height (on the line).

Finally, the vertical distance between an observed height and a predicted height can be expressed as:

Distance = |Y – YL| = |Error|

Let’s see how this works for the line Height = Arm Span (i.e., YL = X).

The following table shows the arm span (X), the actual observed height (Y), the predicted height based on the line Height = Arm Span (i.e., YL = X), the error, and the vertical distance between the person’s observed height (Y) and predicted height (YL) for Persons 1 through 6 in our study:

### Notes

**Note 3**

Fathom Software, used by the participants in the video segments, is helpful in creating graphical representations of data. If you try the problems in Part D using Fathom, you will be able to test various slopes and intercepts. For more information on Fathom, go to the Key Curriculum Press Web site at www.keypress.com/fathom/.

**Note 4**

More advanced presentations of this topic use such ideas as the standard deviation around the regression line and the coefficient R-Squared. The data in this session has been structured so that using the sum of squares for comparison gives a reasonable result.

### Solutions

**Problem D1**

Overall, there is an upward trend; that is, the points generally go up and to the right. This corresponds to the positive association between height and arm span.

**Problem D2
a. **The line does a reasonably good job. Some points are above the line, some are below it, and some are on the line, but all are generally pretty close.

**b.**It looks like it may be possible for another line to be, overall, “closer” to these points.

**Problem D3**

Answers will vary. The lines Height = Arm Span and Height = Arm Span – 1 each seem to do a good job of dividing the points fairly evenly above and below the line, and matching the overall trend of data. It is difficult to distinguish between them without a more mathematical test. Each is clearly better than Height = Arm Span + 1, which lies above a majority of the points.

**Problem D4**

**Problem D5
**Here is the completed table:

**Problem D6
**

Problem D7

The sum of squared errors (SSE) is 49 + 16 + … + 169 = 772. Since this is less than the sum of squared errors for the line Height = Arm Span (which was 784), the line Height = Arm Span – 1 is a slightly better fit.

**Problem D8
a. **The best model is YL = X – .7, because it has the smallest SSE. The worst model is YL = X + 1, because it has the largest SSE.

**As all of these lines have the same slope, if we changed the slope, we might find ways to reduce the SSE.**

b.

b.

**No, we cannot reduce the SSE to zero unless all the data points lie on a straight line, which these 24 points clearly do not do.**

c.

c.