1. Write one assumption.

2. Write another assumption.

3. Write a third assumption.

4. Write a fourth assumption.

5. Write the final assumption.

6. State the null hypothesis for a one-way ANOVA test if there are four groups.

7. State the alternative hypothesis for a one-way ANOVA test if there are three groups.

8. When do you use an ANOVA test?

9. Three different traffic routes are tested for mean driving time. The entries in the table are the driving times in minutes on the three different routes. The one-way ANOVA results are shown in Table.

10. State *SS*_{between}, *SS*_{within}, and the *F* statistic.

Route 1 | Route 2 | Route 3 |
---|---|---|

30 | 27 | 16 |

32 | 29 | 41 |

27 | 28 | 22 |

35 | 36 | 31 |

11. Suppose a group is interested in determining whether teenagers obtain their drivers licenses at approximately the same average age across the country. Suppose that the following data are randomly collected from five teenagers in each region of the country. The numbers represent the age at which teenagers obtained their drivers licenses.

Northeast | South | West | Central | East | |
---|---|---|---|---|---|

16.3 | 16.9 | 16.4 | 16.2 | 17.1 | |

16.1 | 16.5 | 16.5 | 16.6 | 17.2 | |

16.4 | 16.4 | 16.6 | 16.5 | 16.6 | |

16.5 | 16.2 | 16.1 | 16.4 | 16.8 | |

[latex]overline{x}[/latex]= | ________ | ________ | ________ | ________ | ________ |

[latex]{s}_{2}[/latex]= | ________ | ________ | ________ | ________ | ________ |

12. State the hypotheses.

*H _{0}*: ____________

*Use the following information to answer the next eight exercises.* Groups of men from three different areas of the country are to be tested for mean weight. The entries in the table are the weights for the different groups. The one-way ANOVA results are shown in Table.

Group 1 | Group 2 | Group 3 |
---|---|---|

216 | 202 | 170 |

198 | 213 | 165 |

240 | 284 | 182 |

187 | 228 | 197 |

176 | 210 | 201 |

13. What is the Sum of Squares Factor?

14. What is the Sum of Squares Error?

15. What is the *df* for the numerator?

16. What is the *df* for the denominator?

17. What is the Mean Square Factor?

18. What is the Mean Square Error?

19. What is the *F* statistic?

Team 1 | Team 2 | Team 3 | Team 4 |
---|---|---|---|

1 | 2 | 0 | 3 |

2 | 3 | 1 | 4 |

0 | 2 | 1 | 4 |

3 | 4 | 0 | 3 |

2 | 4 | 0 | 2 |

20. What is *SS*_{between}?

21. What is the *df* for the numerator?

22. What is *MS*_{between}?

23. What is *SS*_{within}?

24. What is the *df* for the denominator?

25. What is *MS*_{within}?

26. What is the *F* statistic?

27. Judging by the *F* statistic, do you think it is likely or unlikely that you will reject the null hypothesis?

Northeast | South | West | Central | East | |
---|---|---|---|---|---|

16.3 | 16.9 | 16.4 | 16.2 | 17.1 | |

16.1 | 16.5 | 16.5 | 16.6 | 17.2 | |

16.4 | 16.4 | 16.6 | 16.5 | 16.6 | |

16.5 | 16.2 | 16.1 | 16.4 | 16.8 | |

[latex]overline{x}[/latex]= | ________ | ________ | ________ | ________ | ________ |

[latex]{s}_{2}[/latex] | ________ | ________ | ________ | ________ | ________ |

*29. Hα*: At least any two of the group means *µ*_{1}, *µ*_{2}, …, *µ*_{5} are not equal.

30. degrees of freedom – numerator: *df*(*num*) = _________

31. degrees of freedom – denominator: *df*(*denom*) = ________

32. *F* statistic = ________

33. An *F* statistic can have what values?

34. What happens to the curves as the degrees of freedom for the numerator and the denominator get larger?

Team 1 | Team 2 | Team 3 | Team 4 | Team 5 |
---|---|---|---|---|

36 | 32 | 48 | 38 | 41 |

42 | 35 | 50 | 44 | 39 |

51 | 38 | 39 | 46 | 40 |

35. What is the *df(num)*?

36. What is the *df(denom)*?

37. What are the Sum of Squares and Mean Squares Factors?

38. What are the Sum of Squares and Mean Squares Errors?

39. What is the *F* statistic?

40. What is the *p*-value?

41. At the 5% significance level, is there a difference in the mean jump heights among the teams?

Group A | Group B | Group C |
---|---|---|

101 | 151 | 101 |

108 | 149 | 109 |

98 | 160 | 198 |

107 | 112 | 186 |

111 | 126 | 160 |

42. What is the *df(num)*?

43. What is the *df(denom)*?

44. What are the *SS*_{between} and *MS*_{between}?

45. What are the *SS*_{within} and *MS*_{within}?

46. What is the *F* Statistic?

47. What is the *p*-value?

48. At the 10% significance level, are the scores among the different groups different?

Northeast | South | West | Central | East | |
---|---|---|---|---|---|

16.3 | 16.9 | 16.4 | 16.2 | 17.1 | |

16.1 | 16.5 | 16.5 | 16.6 | 17.2 | |

16.4 | 16.4 | 16.6 | 16.5 | 16.6 | |

16.5 | 16.2 | 16.1 | 16.4 | 16.8 | |

[latex]overline{x}[/latex]= | ________ | ________ | ________ | ________ | ________ |

[latex]{s}_{2}[/latex]= | ________ | ________ | ________ | ________ | ________ |

b. Conclusion: ____________________________

a. Decision: ____________________________

b. Conclusion: ____________________________

DIRECTIONS

Use a solution sheet to conduct the following hypothesis tests. The solution sheet can be found in Appendix E.

53. Three students, Linda, Tuan, and Javier, are given five laboratory rats each for a nutritional experiment. Each rat's weight is recorded in grams. Linda feeds her rats Formula A, Tuan feeds his rats Formula B, and Javier feeds his rats Formula C. At the end of a specified time period, each rat is weighed again, and the net gain in grams is recorded. Using a significance level of 10%, test the hypothesis that the three formulas produce the same mean weight gain.

Linda's rats | Tuan's rats | Javier's rats |
---|---|---|

43.5 | 47.0 | 51.2 |

39.4 | 40.5 | 40.9 |

41.3 | 38.9 | 37.9 |

46.0 | 46.3 | 45.0 |

38.2 | 44.2 | 48.6 |

54. A grassroots group opposed to a proposed increase in the gas tax claimed that the increase would hurt working-class people the most, since they commute the farthest to work. Suppose that the group randomly surveyed 24 individuals and asked them their daily one-way commuting mileage. The results are in Table. Using a 5% significance level, test the hypothesis that the three mean commuting mileages are the same.

working-class | professional (middle incomes) | professional (wealthy) |
---|---|---|

17.8 | 16.5 | 8.5 |

26.7 | 17.4 | 6.3 |

49.4 | 22.0 | 4.6 |

9.4 | 7.4 | 12.6 |

65.4 | 9.4 | 11.0 |

47.1 | 2.1 | 28.6 |

19.5 | 6.4 | 15.4 |

51.2 | 13.9 | 9.3 |

55. Examine the seven practice laps from Appendix C. Determine whether the mean lap time is statistically the same for the seven practice laps, or if there is at least one lap that has a different mean time from the others.

home decorating | news | health | computer |
---|---|---|---|

172 | 87 | 82 | 104 |

286 | 94 | 153 | 136 |

163 | 123 | 87 | 98 |

205 | 106 | 103 | 207 |

197 | 101 | 96 | 146 |

56. Using a significance level of 5%, test the hypothesis that the four magazine types have the same mean length.

57. Eliminate one magazine type that you now feel has a mean length different from the others. Redo the hypothesis test, testing that the remaining three means are statistically the same. Use a new solution sheet. Based on this test, are the mean lengths for the remaining three magazines statistically the same?

58. A researcher wants to know if the mean times (in minutes) that people watch their favorite news station are the same. Suppose that Table shows the results of a study.

CNN | FOX | Local |
---|---|---|

45 | 15 | 72 |

12 | 43 | 37 |

18 | 68 | 56 |

38 | 50 | 60 |

23 | 31 | 51 |

35 | 22 |

59. Assume that all distributions are normal, the four population standard deviations are approximately the same, and the data were collected independently and randomly. Use a level of significance of 0.05.

60. Are the means for the final exams the same for all statistics class delivery types? Table shows the scores on final exams from several randomly selected classes that used the different delivery types.

Online | Hybrid | Face-to-Face |
---|---|---|

72 | 83 | 80 |

84 | 73 | 78 |

77 | 84 | 84 |

80 | 81 | 81 |

81 | 86 | |

79 | ||

82 |

61. Assume that all distributions are normal, the four population standard deviations are approximately the same, and the data were collected independently and randomly. Use a level of significance of 0.05.

62. Are the mean number of times a month a person eats out the same for whites, blacks, Hispanics and Asians? Suppose that Table shows the results of a study.

White | Black | Hispanic | Asian |
---|---|---|---|

6 | 4 | 7 | 8 |

8 | 1 | 3 | 3 |

2 | 5 | 5 | 5 |

4 | 2 | 4 | 1 |

6 | 6 | 7 |

63. Assume that all distributions are normal, the four population standard deviations are approximately the same, and the data were collected independently and randomly. Use a level of significance of 0.05.

64. Are the mean numbers of daily visitors to a ski resort the same for the three types of snow conditions? Suppose that Table shows the results of a study.

Powder | Machine Made | Hard Packed |
---|---|---|

1,210 | 2,107 | 2,846 |

1,080 | 1,149 | 1,638 |

1,537 | 862 | 2,019 |

941 | 1,870 | 1,178 |

1,528 | 2,233 | |

1,382 |

65. Assume that all distributions are normal, the four population standard deviations are approximately the same, and the data were collected independently and randomly. Use a level of significance of 0.05.

66. Sanjay made identical paper airplanes out of three different weights of paper, light, medium and heavy. He made four airplanes from each of the weights, and launched them himself across the room. Here are the distances (in meters) that his planes flew.

Paper Type/Trial | Trial 1 | Trial 2 | Trial 3 | Trial 4 |
---|---|---|---|---|

Heavy | 5.1 meters | 3.1 meters | 4.7 meters | 5.3 meters |

Medium | 4 meters | 3.5 meters | 4.5 meters | 6.1 meters |

Light | 3.1 meters | 3.3 meters | 2.1 meters | 1.9 meters |

- Take a look at the data in the graph. Look at the spread of data for each group (light, medium, heavy). Does it seem reasonable to assume a normal distribution with the same variance for each group? Yes or No.
- Why is this a balanced design?
- Calculate the sample mean and sample standard deviation for each group.
- Does the weight of the paper have an effect on how far the plane will travel? Use a 1% level of significance. Complete the test using the method shown in the bean plant example in Example.
- variance of the group means __________
*MS*= ____________{between}- mean of the three sample variances ___________
*MS*= ______________{within}*F*statistic = ____________*df(num)*= __________,*df(denom)*= ___________- number of groups _______
- number of observations _______
*p*-value = __________ (*P*(*F*> _______) = __________)- Graph the
*p*-value. - decision: _______________________
- conclusion: _______________________________________________________________

67. DDT is a pesticide that has been banned from use in the United States and most other areas of the world. It is quite effective, but persisted in the environment and over time became seen as harmful to higher-level organisms. Famously, egg shells of eagles and other raptors were believed to be thinner and prone to breakage in the nest because of ingestion of DDT in the food chain of the birds.

68. An experiment was conducted on the number of eggs (fecundity) laid by female fruit flies. There are three groups of flies. One group was bred to be resistant to DDT (the RS group). Another was bred to be especially susceptible to DDT (SS). Finally there was a control line of non-selected or typical fruitflies (NS). Here are the data:

RS | SS | NS | RS | SS | NS |
---|---|---|---|---|---|

12.8 | 38.4 | 35.4 | 22.4 | 23.1 | 22.6 |

21.6 | 32.9 | 27.4 | 27.5 | 29.4 | 40.4 |

14.8 | 48.5 | 19.3 | 20.3 | 16 | 34.4 |

23.1 | 20.9 | 41.8 | 38.7 | 20.1 | 30.4 |

34.6 | 11.6 | 20.3 | 26.4 | 23.3 | 14.9 |

19.7 | 22.3 | 37.6 | 23.7 | 22.9 | 51.8 |

22.6 | 30.2 | 36.9 | 26.1 | 22.5 | 33.8 |

29.6 | 33.4 | 37.3 | 29.5 | 15.1 | 37.9 |

16.4 | 26.7 | 28.2 | 38.6 | 31 | 29.5 |

20.3 | 39 | 23.4 | 44.4 | 16.9 | 42.4 |

29.3 | 12.8 | 33.7 | 23.2 | 16.1 | 36.6 |

14.9 | 14.6 | 29.2 | 23.6 | 10.8 | 47.4 |

27.3 | 12.2 | 41.7 |

69. The values are the average number of eggs laid daily for each of 75 flies (25 in each group) over the first 14 days of their lives. Using a 1% level of significance, are the mean rates of egg selection for the three strains of fruitfly different? If so, in what way? Specifically, the researchers were interested in whether or not the selectively bred strains were different from the nonselected line, and whether the two selected lines were different from each other.

Here is a chart of the three groups:

70. The data shown is the recorded body temperatures of 130 subjects as estimated from available histograms.

71. Traditionally we are taught that the normal human body temperature is 98.6 F. This is not quite correct for everyone. Are the mean temperatures among the four groups different?

72. Calculate 95% confidence intervals for the mean body temperature in each group and comment about the confidence intervals. 99.198.699.5 99.198.6 99.298.7 99.499.1 99.999.3 10099.4 100.8

FL | FH | ML | MH | FL | FH | ML | MH |
---|---|---|---|---|---|---|---|

96.4 | 96.8 | 96.3 | 96.9 | 98.4 | 98.6 | 98.1 | 98.6 |

96.7 | 97.7 | 96.7 | 97 | 98.7 | 98.6 | 98.1 | 98.6 |

97.2 | 97.8 | 97.1 | 97.1 | 98.7 | 98.6 | 98.2 | 98.7 |

97.2 | 97.9 | 97.2 | 97.1 | 98.7 | 98.7 | 98.2 | 98.8 |

97.4 | 98 | 97.3 | 97.4 | 98.7 | 98.7 | 98.2 | 98.8 |

97.6 | 98 | 97.4 | 97.5 | 98.8 | 98.8 | 98.2 | 98.8 |

97.7 | 98 | 97.4 | 97.6 | 98.8 | 98.8 | 98.3 | 98.9 |

97.8 | 98 | 97.4 | 97.7 | 98.8 | 98.8 | 98.4 | 99 |

97.8 | 98.1 | 97.5 | 97.8 | 98.8 | 98.9 | 98.4 | 99 |

97.9 | 98.3 | 97.6 | 97.9 | 99.2 | 99 | 98.5 | 99 |

97.9 | 98.3 | 97.6 | 98 | 99.3 | 99 | 98.5 | 99.2 |

98 | 98.3 | 97.8 | 98 | ||||

98.2 | 98.4 | 97.8 | 98 | ||||

98.2 | 98.4 | 97.8 | 98.3 | ||||

98.2 | 98.4 | 97.9 | 98.4 | ||||

98.2 | 98.4 | 98 | 98.4 | ||||

98.2 | 98.5 | 98 | 98.6 | ||||

98.2 | 98.6 | 98 | 98.6 |

73. Name one assumption that must be true.

74. What is the other assumption that must be true?

75. State the null and alternative hypotheses.

76. What is *s*_{1} in this problem?

77. What is *s*_{2} in this problem?

78. What is *n*?

79. What is the *F* statistic?

80. What is the *p*-value?

81. Is the claim accurate?

82. State the null and alternative hypotheses.

83. What is the *F* Statistic?

84. What is the *p*-value?

85. At the 5% significance level, do we reject the null hypothesis?

86. State the null and alternative hypotheses.

87. What is the *F* Statistic?

88. At the 5% significance level, what can we say about the cyclists’ variances?

89. Three students, Linda, Tuan, and Javier, are given five laboratory rats each for a nutritional experiment. Each rat’s weight is recorded in grams. Linda feeds her rats Formula A, Tuan feeds his rats Formula B, and Javier feeds his rats Formula C. At the end of a specified time period, each rat is weighed again and the net gain in grams is recorded.

Linda's rats | Tuan's rats | Javier's rats |
---|---|---|

43.5 | 47.0 | 51.2 |

39.4 | 40.5 | 40.9 |

41.3 | 38.9 | 37.9 |

46.0 | 46.3 | 45.0 |

38.2 | 44.2 | 48.6 |

91. A grassroots group opposed to a proposed increase in the gas tax claimed that the increase would hurt working-class people the most, since they commute the farthest to work. Suppose that the group randomly surveyed 24 individuals and asked them their daily one-way commuting mileage. The results are as follows.

working-class | professional (middle incomes) | professional (wealthy) |
---|---|---|

17.8 | 16.5 | 8.5 |

26.7 | 17.4 | 6.3 |

49.4 | 22.0 | 4.6 |

9.4 | 7.4 | 12.6 |

65.4 | 9.4 | 11.0 |

47.1 | 2.1 | 28.6 |

19.5 | 6.4 | 15.4 |

51.2 | 13.9 | 9.3 |

92. Determine whether or not the variance in mileage driven is statistically the same among the working class and professional (middle income) groups. Use a 5% significance level.

**Refer to the data from Appendix C.**
93. Examine practice laps 3 and 4. Determine whether or not the variance in lap time is statistically the same for those practice laps.

*Use the following information to answer the next two exercises.* The following table lists the number of pages in four different types of magazines.

home decorating | news | health | computer |
---|---|---|---|

172 | 87 | 82 | 104 |

286 | 94 | 153 | 136 |

163 | 123 | 87 | 98 |

205 | 106 | 103 | 207 |

197 | 101 | 96 | 146 |

94. Which two magazine types do you think have the same variance in length?

95. Which two magazine types do you think have different variances in length?

96. Is the variance for the amount of money, in dollars, that shoppers spend on Saturdays at the mall the same as the variance for the amount of money that shoppers spend on Sundays at the mall? Suppose that the Table shows the results of a study.

Saturday | Sunday | Saturday | Sunday |
---|---|---|---|

75 | 44 | 62 | 137 |

18 | 58 | 0 | 82 |

150 | 61 | 124 | 39 |

94 | 19 | 50 | 127 |

62 | 99 | 31 | 141 |

73 | 60 | 118 | 73 |

89 |

97. Are the variances for incomes on the East Coast and the West Coast the same? Suppose that Table shows the results of a study. Income is shown in thousands of dollars. Assume that both distributions are normal. Use a level of significance of 0.05.

East | West |
---|---|

38 | 71 |

47 | 126 |

30 | 42 |

82 | 51 |

75 | 44 |

52 | 90 |

115 | 88 |

67 |

98. Thirty men in college were taught a method of finger tapping. They were randomly assigned to three groups of ten, with each receiving one of three doses of caffeine: 0 mg, 100 mg, 200 mg. This is approximately the amount in no, one, or two cups of coffee. Two hours after ingesting the caffeine, the men had the rate of finger tapping per minute recorded. The experiment was double blind, so neither the recorders nor the students knew which group they were in. Does caffeine affect the rate of tapping, and if so how?

Here are the data:

0 mg | 100 mg | 200 mg | 0 mg | 100 mg | 200 mg |
---|---|---|---|---|---|

242 | 248 | 246 | 245 | 246 | 248 |

244 | 245 | 250 | 248 | 247 | 252 |

247 | 248 | 248 | 248 | 250 | 250 |

242 | 247 | 246 | 244 | 246 | 248 |

246 | 243 | 245 | 242 | 244 | 250 |

99. King Manuel I, Komnenus ruled the Byzantine Empire from Constantinople (Istanbul) during the years 1145 to 1180 A.D. The empire was very powerful during his reign, but declined significantly afterwards. Coins minted during his era were found in Cyprus, an island in the eastern Mediterranean Sea. Nine coins were from his first coinage, seven from the second, four from the third, and seven from a fourth. These spanned most of his reign. We have data on the silver content of the coins:
6.2 5.8 5.8

First Coinage | Second Coinage | Third Coinage | Fourth Coinage |
---|---|---|---|

5.9 | 6.9 | 4.9 | 5.3 |

6.8 | 9.0 | 5.5 | 5.6 |

6.4 | 6.6 | 4.6 | 5.5 |

7.0 | 8.1 | 4.5 | 5.1 |

6.6 | 9.3 | ||

7.7 | 9.2 | ||

7.2 | 8.6 | ||

6.9 | |||

6.2 |

100. Did the silver content of the coins change over the course of Manuel’s reign?

101. Here are the means and variances of each coinage. The data are unbalanced.

First | Second | Third | Fourth | |
---|---|---|---|---|

Mean | 6.7444 | 8.2429 | 4.875 | 5.6143 |

Variance | 0.2953 | 1.2095 | 0.2025 | 0.1314 |

102. The American League and the National League of Major League Baseball are each divided into three divisions: East, Central, and West. Many years, fans talk about some divisions being stronger (having better teams) than other divisions. This may have consequences for the postseason. For instance, in 2012 Tampa Bay won 90 games and did not play in the postseason, while Detroit won only 88 and did play in the postseason. This may have been an oddity, but is there good evidence that in the 2012 season, the American League divisions were significantly different in overall records? Use the following data to test whether the mean number of wins per team in the three American League divisions were the same or not. Note that the data are not balanced, as two divisions had five teams, while one had only four.

Division | Team | Wins |
---|---|---|

East | NY Yankees | 95 |

East | Baltimore | 93 |

East | Tampa Bay | 90 |

East | Toronto | 73 |

East | Boston | 69 |

Division | Team | Wins |
---|---|---|

Central | Detroit | 88 |

Central | Chicago Sox | 85 |

Central | Kansas City | 72 |

Central | Cleveland | 68 |

Central | Minnesota | 66 |

Division | Team | Wins |
---|---|---|

West | Oakland | 94 |

West | Texas | 93 |

West | LA Angels | 89 |

West | Seattle | 75 |

Table 1. Four Observations from the mario-kart data set. | |||||
---|---|---|---|---|---|

price | cond_new | stock_photo | duration | wheels | |

1 | 51.55 | 1 | 1 | 3 | 1 |

2 | 37.04 | 0 | 1 | 7 | 1 |

. | . | . | . | . | . |

. | . | . | . | . | . |

. | . | . | . | . | . |

140 | 38.76 | 0 | 0 | 7 | 0 |

141 | 54.51 | 1 | 1 | 1 | 2 |

Table 2. Variables and their descriptions for the mario-kart data set. | |
---|---|

Variable | Description |

price | final auction price plus shipping costs, in US dollars a coded two-level categorical variable, which takes value 1 when the game is new and 0 if the game is used |

stock_photo | a coded two-level categorical variable, which takes value 1 if the primary photo used in the auction was a stock photo and 0 if the photo was unique to that auction |

duration | the length of the auction, in days, taking values from 1 to 10 |

wheels | the number of Wii wheels included with the auction (a Wii wheel is a plastic racing wheel that holds the Wii controller and is an optional but helpful accessory for playing Mario Kart) |

[latex]widehat{text{price}}=42.87 + 10.90timestext{cond_new}[/latex] Results of this model are shown in Table 3 and a scatterplot for price versus game condition.

Table 3. Summary of a linear model for predicting auction price based on game condition. | ||||
---|---|---|---|---|

Estimate | Std. Error | t value | Pr ( > |t|) | |

(Intercept) | 42.8711 | 0.8140 | 52.67 | 0.0000 |

cond new | 10.8996 | 1.2583 | 8.66 | 0.0000 |

df = 139 |

[latex]begin{array}widehat{text{price}}hfill &={beta}_{0}hfill &+{beta}_{1}timestext{cond_new}hfill&+{beta}_{2}timestext{stock_photo}text{ }hfill &+{beta}_{3}timestext{duration}hfill&+{beta}_{4}hfill×text{wheels}\hat{y}hfill &={beta}_{0}hfill &+{beta}_{1}{x}_{1}hfill &+{beta}_{2}{x}_{2}hfill &+{beta}_{3}{x}_{3}hfill &+{beta}_{4}{x}_{4}end{array}[/latex]
In this equation, *y* represents the total price, *x*_{1} indicates whether the game is new,* x*

SSE = [latex]displaystyle{{e}_{1}}^{2}+{{e}_{2}}^{2}+dots+{{e}_{141}}^{2}={sum}_{i = 1}^{141}{left({{e}_{i}}right)}^{2} = {sum}_{i = 1}^{141}{left({{y}_{i} - {hat{y}}_{i}}right)}^{2}[/latex]
Here there are 141 residuals, one for each observation. We typically use a computer to minimize the SSE and compute point estimates, as shown in the sample output in the table below. Using this output, we identify the point estimates *b _{i}* of each

Table 4. Output for the regression model where price is the outcome and cond new, stock photo, duration, and wheels are the predictors. | ||||
---|---|---|---|---|

Estimate | Std. Error | t value | Pr(>|t|) | |

(Intercept) | 36.2110 | 1.5140 | 23.92 | 0.0000 |

cond new | 5.1306 | 1.0511 | 4.88 | 0.0000 |

stock photo | 1.0803 | 1.0568 | 1.02 | 0.3085 |

duration | −0.0268 | 0.1904 | −0.14 | 0.8882 |

wheels | 7.2852 | 0.5547 | 13.13 | 0.0000 |

df = 136 |

[latex]hat{y} ={beta}_{0} +{beta}_{1}{x}_{1}+{beta}_{2}{x}_{2}+dots+{beta}_{k}{x}_{k}[/latex]
when there are *k* predictors. We often estimate the [latex]{beta}_{i}[/latex] parameters using a computer.

Table 1. The fit for the full regression model, including the adjusted .R^{2} | ||||
---|---|---|---|---|

Estimate | Std. Error | t value | Pr( >|t|) | |

(Intercept) | 36.2110 | 1.5140 | 23.92 | 0.0000 |

cond_new | 5.1306 | 1.0511 | 4.88 | 0.0000 |

stock_photo | 1.0803 | 1.0568 | 1.02 | 0.3085 |

duration | –0.0268 | 0.1904 | –0.14 | 0.8882 |

wheels | 7.2852 | 0.5547 | 13.13 | 0.0000 |

= 0R^{2}_{adj}.7108 |

Table 2. The fit for the regression model for predictors cond_new, stock_photo, and wheels. | ||||
---|---|---|---|---|

Estimate | Std. Error | t value | Pr( >|t|) | |

(Intercept) | 36.0483 | 0.9745 | 36.99 | 0.0000 |

cond_new | 5.1763 | 0.9961 | 5.20 | 0.0000 |

stock_photo | 1.1177 | 1.0192 | 1.10 | 0.2747 |

wheels | 7.2984 | 0.5448 | 13.40 | 0.0000 |

= 0R^{2}_{adj}.7128 |

- the residuals of the model are nearly normal,
- the variability of the residuals is nearly constant,
- the residuals are independent, and
- each variable is linearly related to the outcome.

[latex]displaystylewidehat{text{price}}=36.05+5.18timestext{cond_new}+1.12timestext{stock_photo}+7.30timestext{wheels}[/latex]
**Normal probability plot.** A normal probability plot of the residuals is shown in Figure 1. While the plot exhibits some minor irregularities, there are no outliers that might be cause for concern. In a normal probability plot for residuals, we tend to be most worried about residuals that appear to be outliers, since these indicate long tails in the distribution of residuals.
[caption id="attachment_1462" align="aligncenter" width="458"] Figure 1. A normal probability plot of the residuals is helpful in identifying observations that might be outliers.[/caption]**Absolute values of residuals against fitted values.** A plot of the absolute value of the residuals against their corresponding fitted values [latex]left(displaystylehat{y}_iright)[/latex] is shown in Figure 2.
This plot is helpful to check the condition that the variance of the residuals is approximately constant. We don't see any obvious deviations from constant variance in this example.
[caption id="attachment_1463" align="aligncenter" width="531"] Figure 2. Comparing the absolute value of the residuals against the fitted values [latex]left(displaystylehat{y}_iright)[/latex] is helpful in identifying deviations from the constant variance assumption.[/caption]**Residuals in order of their data collection.** A plot of the residuals in the order their corresponding auctions were observed is shown in Figure 3. Such a plot is helpful in identifying any connection between cases that are close to one another, e.g. we could look for declining prices over time or if there was a time of the day when auctions tended to fetch a higher price. Here we see no structure that indicates a problem.[footnote]An especially rigorous check would use time series methods. For instance, we could check whether consecutive residuals are correlated. Doing so with these residuals yields no statistically significant correlations.[/footnote]
[caption id="attachment_1464" align="aligncenter" width="550"] Figure 3. Plotting residuals in the order that their corresponding observations were collected helps identify connections between successive observations. If it seems that consecutive observations tend to be close to each other, this indicates the independence assumption of the observations would fail.[/caption]**Residuals against each predictor variable.** We consider a plot of the residuals against the cond_new variable, the residuals against the stock photo variable, and the residuals against the wheels variable. These plots are shown in Figure 4. For the two-level condition variable, we are guaranteed not to see any remaining trend, and instead we are checking that the variability doesn't ﬂuctuate across groups, which it does not. However, looking at the stock photo variable, we find that there is some difference in the variability of the residuals in the two groups. Additionally, when we consider the residuals against the wheels variable, we see some possible structure. There appears to be curvature in the residuals, indicating the relationship is probably not linear.
[caption id="attachment_1465" align="aligncenter" width="786"] Figure 4. For the condition and stock photo variables, we check for differences in the distribution shape or variability of the residuals. In the case of the stock photos variable, we see a little less variability in the unique photo group than the stock photo group. For numerical predictors, we also check for trends or other structure. We see some slight bowing in the residuals against the wheels variable in the bottom plot.[/caption]It is necessary to summarize diagnostics for any model fit. If the diagnostics support the model assumptions, this would improve credibility in the findings. If the diagnostic assessment shows remaining underlying structure in the residuals, we should try to adjust the model to account for that structure. If we are unable to do so, we may still report the model but must also note its shortcomings. In the case of the auction data, we report that there appears to be non-constant variance in the stock photo variable and that there may be a nonlinear relationship between the total price and the number of wheels included for an auction. This information would be important to buyers and sellers who may review the analysis, and omitting this information could be a setback to the very people who the model might assist.

Table 1. Descriptions for 11 variables in the email data set. Notice that all of the variables are indicator variables, which take the value 1 if the specified characteristic is present and 0 otherwise. | |
---|---|

Variable | Description |

spam | Specifies whether the message was spam |

to_multiple | An indicator variable for if more than one person was listed in the To field of the email. |

cc | An indicator for if someone was CCed on the email |

attach | An indicator for if there was an attachment, such as a document or image |

dollar | An indicator for if the word “dollar” or dollar symbol ($) appeared in the email. |

winner | An indicator for if the word “winner” appeared in the email message |

inherit | An indicator for if the word “inherit” (or a variation, like “inheritance”) appeared in the email. |

password | An indicator for if the word “password” was present in the email. |

format | Indicates if the email contained special formatting, such as bolding, tables, or links |

re_subj | Indicates whether “Re:” was included at the start of the email subject. |

exclaim_subj | Indicates whether any exclamation point was included in the email subject |

[latex]text{transformation}left(p_iright)=beta_0+beta_1x_{1,i}+beta_2x_{2,i}+dotsbeta_k{x}_{k,i}[/latex]
We want to choose a transformation in the equation above that makes practical and mathematical sense. For example, we want a transformation that makes the range of possibilities on the left hand side of the equation above equal to the range of possibilities for the right hand side; if there was no transformation for this equation, the left hand side could only take values between 0 and 1, but the right hand side could take values outside of this range. A common transformation for* **p _{i}* is the

[latex]displaystyletext{logit}left(p_iright)=log_{e}left(frac{p_i}{1-p_i}right)[/latex]
The logit transformation is shown in Figure 1. Below, we rewrite the transformation equation using the logit transformation of* **p _{i}*:

[latex]displaystylelog_eleft(frac{p_i}{1-p_i}right)=beta_0+beta_1x_{1,i}+beta_2x_{2,i}+dotsbeta_k{x}_{k,i}[/latex]
[caption id="attachment_1468" align="aligncenter" width="768"] Figure 1. Values of *p _{i}* against values of logit(

[latex]displaystyle{p}_i=frac{e^{beta_0+beta_1x_{1,i}+beta_2x_{2,i}+dotsbeta_k{x}_{k,i}}}{1 + e^{beta_0+beta_1x_{1,i}+beta_2x_{2,i}+dotsbeta_k{x}_{k,i}}}[/latex]
As with most applied data problems, we substitute the point estimates for the parameters (the *β _{i}*) so that we may make use of this formula. In Example 1, the probabilities were calculated as

[latex]displaystylebegin{array}text{ }frac{e^{-2.12}}{1+e^{-2.12}}=0.11hfill&text{ }hfill&frac{e^{-2.12-1.81}}{1+e^{-2.12-1.81}}=0.02end{array}[/latex]
While the information about whether the email is addressed to multiple people is a helpful start in classifying email as spam or not, the probabilities of 11% and 2% are not dramatically different, and neither provides very strong evidence about which particular email messages are spam. To get more precise estimates, we'll need to include many more variables in the model.
We used statistical software to fit the logistic regression model with all ten predictors described in Table 1. Like multiple regression, the result may be presented in a summary table, which is shown in Table 2. The structure of this table is almost identical to that of multiple regression; the only notable difference is that the *p*-values are calculated using the normal distribution rather than the* t*-distribution.

Table 2. Summary table for the full logistic regression model for the spam filter example | ||||
---|---|---|---|---|

Estimate | Std. Error | z value | Pr ( > |z|) | |

(Intercept) | –0.8362 | 0.0962 | –8.69 | 0.0000 |

to_multiple | –2.8836 | 0.3121 | –9.24 | 0.0000 |

winner | 1.7038 | 0.3254 | 5.24 | 0.0000 |

format | –1.5902 | 0.1239 | –12.84 | 0.0000 |

re_subj | –2.9082 | 0.3708 | –7.84 | 0.0000 |

exclaim_subj | 0.1355 | 0.2268 | 0.60 | 0.5503 |

cc | –0.4863 | 0.3054 | –1.59 | 0.1113 |

attach | 0.9790 | 0.2170 | 4.51 | 0.0000 |

dollar | –0.0582 | 0.1589 | –0.37 | 0.7144 |

inherit | 0.2093 | 0.3197 | 0.65 | 0.5127 |

password | –1.4929 | 0.5295 | –2.82 | 0.0048 |

Table 3. Summary table for the logistic regression model for the spam filter, where variable selection has been performed | ||||
---|---|---|---|---|

Estimate | Std. Error | z value | Pr ( > |z|) | |

(Intercept) | –0.8595 | 0.0910 | –9.44 | 0.0000 |

to_multiple | –2.8372 | 0.3092 | –9.18 | 0.0000 |

winner | 1.7370 | 0.3218 | 5.40 | 0.0000 |

format | –1.5569 | 0.1207 | –12.90 | 0.0000 |

re_subj | –3.0482 | 0.3630 | –8.40 | 0.0000 |

attach | 0.8643 | 0.2042 | 4.23 | 0.0000 |

password | –1.4871 | 0.5290 | –2.81 | 0.0049 |

- The email characteristics generally indicate the email is not spam, and so the resulting probability that the email is spam is quite low, say, under 0.05.
- The characteristics generally indicate the email is spam, and so the resulting probability that the email is spam is quite large, say, over 0.95.
- The characteristics roughly balance each other out in terms of evidence for and against the message being classified as spam. Its probability falls in the remaining range, meaning the email cannot be adequately classified as spam or not spam.

- Each predictor
*x*is linearly related to logit(_{i}*p*) if all other predictors are held constant._{i} - Each outcome
*Y*is independent of the other outcomes._{i}

[latex]displaystyle{e}_i=Y_i-hat{p}_i[/latex] We could plot these residuals against a variety of variables or in their order of collection, as we did with the residuals in multiple regression. However, since the model will need to be revised to effectively classify spam and you have already seen similar residual plots, we won't investigate the residuals here. [caption id="attachment_1471" align="aligncenter" width="774"] Figure 3: The solid black line provides the empirical estimate of the probability for observations based on their predicted probabilities (confidence bounds are also shown for this line), which is fit using natural splines. A small amount of noise was added to the observations in the plot to allow more observations to be seen.[/caption]

- An indicator variable could be used to represent whether there was prior two-way correspondence with a message's sender. For instance, if you sent a message to john@example.com and then John sent you an email, this variable would take value 1 for the email that John sent. If you had never sent John an email, then the variable would be set to 0.
- A second indicator variable could utilize an account's past spam ﬂagging information. The variable could take value 1 if the sender of the message has previously sent messages ﬂagged as spam.
- A third indicator variable could ﬂag emails that contain links included in previous spam messages. If such a link is found, then set the variable to 1 for the email. Otherwise, set it to 0.

Table 4. A contingency table for spam and a new variable that represents whether there had been correspondence with the sender in the preceding 30 days | |||
---|---|---|---|

prior correspondence | |||

no | yes | Total | |

spam | 367 | o | 367 |

not spam | 2464 | 1090 | 3554 |

Total | 2831 | 1090 | 3921 |

Estimate | Std. Error | t-value | Pr(> |t|) | |

(Intercept) | 123.05 | 0.65 | 189.60 | 0.0000 |

smoke | –8.94 | 1.03 | –8.65 | 0.0000 |

- Write the equation of the regression line.
- Interpret the slope in this context, and calculate the predicted birth weight of babies born to smoker and non-smoker mothers.
- Is there a statistically significant relationship between the average birth weight and smoking?

Estimate | Std. Error | t-value | Pr(> |t|) | |

(Intercept) | 120.07 | 0.60 | 199.94 | 0.0000 |

parity | –1.93 | 1.19 | –1.62 | 0.1052 |

- Write the equation of the regression line.
- Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.
- Is there a statistically significant relationship between the average birth weight and parity?

bwt | gestation | parity | age | height | weight | smoke | |

1 | 120 | 284 | 0 | 27 | 62 | 100 | 0 |

2 | 113 | 282 | 0 | 33 | 64 | 135 | 0 |

[latex]vdots[/latex] | [latex]vdots[/latex] | [latex]vdots[/latex] | [latex]vdots[/latex] | [latex]vdots[/latex] | [latex]vdots[/latex] | [latex]vdots[/latex] | [latex]vdots[/latex] |

1236 | 117 | 297 | 0 | 38 | 65 | 129 | 0 |

Estimate | Std. Error | t-value | Pr(>|t|) | |

(Intercept) | –80.41 | 14.35 | –5.60 | 0.0000 |

gestation | 0.44 | 0.03 | 15.26 | 0.0000 |

parity | –3.33 | 1.13 | –2.95 | 0.0033 |

age | –0.01 | 0.09 | –0.10 | 0.9170 |

height | 1.15 | 0.21 | 5.63 | 0.0000 |

weight | 0.05 | 0.03 | 1.99 | 0.0471 |

smoke | –8.40 | 0.95 | –8.81 | 0.0000 |

- Write the equation of the regression line that includes all of the variables.
- Interpret the slopes of gestation and age in this context.
- The coefficient for parity is different than in the linear model shown in Exercise 2. Why might there be a difference?
- Calculate the residual for the first observation in the data set.
- The variance of the residuals is 249.28, and the variance of the birth weights of all babies in the data set is 332.57. Calculate the
*R*^{2}and the adjusted*R*^{2}. Note that there are 1,236 observations in the data set.

eth | sex | lrn | days | |

1 | 0 | 1 | 1 | 2 |

2 | 0 | 1 | 1 | 11 |

[latex]vdots[/latex] | [latex]vdots[/latex] | [latex]vdots[/latex] | [latex]vdots[/latex] | [latex]vdots[/latex] |

146 | 1 | 0 | 0 | 37 |

Estimate | Std. Error | t-value | Pr(>|t|) | |

(Intercept) | 18.93 | 2.57 | 7.37 | 0.0000 |

eth | –9.11 | 2.60 | –3.51 | 0.0000 |

sex | 3.10 | 2.64 | 1.18 | 0.2411 |

lrn | 2.15 | 2.65 | 0.81 | 0.4177 |

- Write the equation of the regression line.
- Interpret each one of the slopes in this context.
- Calculate the residual for the first observation in the data set: a student who is aboriginal, male, a slow learner, and missed 2 days of school.
- The variance of the residuals is 240.57, and the variance of the number of absent days for all students in the data set is 264.17. Calculate the
*R*^{2}and the adjusted*R*^{2}. Note that there are 146 observations in the data set.

Estimate | Std. Error | t-value | Pr(>|t|) | |

(Intercept) | 3.45 | 0.35 | 9.85 | 0.00 |

studyweek | 0.00 | 0.00 | 0.27 | 0.79 |

sleepnight | 0.01 | 0.05 | 0.11 | 0.91 |

outnight | 0.05 | 0.05 | 1.01 | 0.32 |

gender | –0.08 | 0.12 | –0.68 | 0.50 |

- Calculate a 95% confidence interval for the coefficient of gender in the model, and interpret it in the context of the data.
- Would you expect a 95% confidence interval for the slope of the remaining variables to include 0? Explain

Estimate | Std. Error | t-value | Pr(> |t|) | |

(Intercept) | –57.99 | 8.64 | –6.71 | 0.00 |

height | 0.34 | 0.13 | 2.61 | 0.01 |

diameter | 4.71 | 0.26 | 17.82 | 0.00 |

- Calculate a 95% confidence interval for the coefficient of height, and interpret it in the context of the data.
- One tree in this sample is 79 feet tall, has a diameter of 11.3 inches, and is 24.2 cubic feet in volume. Determine if the model overestimates or underestimates the volume of this tree, and by how much.

Model | Adjusted R^{2} | |

1 | Full model | 0.2541 |

2 | No gestation | 0.1031 |

3 | No parity | 0.2492 |

4 | No age | 0.2547 |

5 | No height | 0.2311 |

6 | No weight | 0.2536 |

7 | No smoking status | 0.2072 |

Model | Adjusted R^{2} | |

1 | Full model | 0.0701 |

2 | No ethnicity | –0.0033 |

3 | No sex | 0.0676 |

4 | No learner status | 0.0723 |

variable | gestation | parity | age | height | weight | smoke |

p-value | 2.2 × 10^{−16} | 0.1052 | 0.2375 | 2.97 × 10^{−12} | 8.2 × 10^{−8} | 2.2 × 10^{−16} |

R^{2}_{adj} | 0.4657 | 0.0013 | 0.0003 | 0.0386 | 0.0229 | 0.0569 |

variable | ethnicity | sex | learner status |

p-value | 0.007 | 0.3142 | 0.5870 |

R^{2}_{adj} | 0.0714 | 0.0001 | 0 |

Full Model | Reduced Model | |||||||
---|---|---|---|---|---|---|---|---|

Estimate | SE | Z | Pr(>|Z|) | Estimate | SE | Z | Pr(>|Z|) | |

(Intercept) | 39.2349 | 11.5368 | 3.40 | 0.0007 | 33.5095 | 9.9053 | 3.38 | 0.0007 |

sex_male | −1.2376 | 0.6662 | −1.86 | 0.0632 | −1.4207 | 0.6457 | −2.20 | 0.0278 |

head_length | −0.1601 | 0.1386 | −1.16 | 0.2480 | ||||

skull_width | −0.2012 | 0.1327 | −1.52 | 0.1294 | −0.2787 | 0.1226 | −2.27 | 0.0231 |

total_length | 0.6488 | 0.1531 | 4.24 | 0.0000 | 0.5687 | 0.1322 | 4.30 | 0.0000 |

tail_length | −1.8708 | 0.3741 | −5.00 | 0.0000 | −1.8057 | 0.3599 | −5.02 | 0.0000 |

- Examine each of the predictors. Are there any outliers that are likely to have a very large inﬂuence on the logistic regression model?
- The summary table for the full model indicates that at least one variable should be eliminated when using the p-value approach for variable selection: head length. The second component of the table summarizes the reduced model following variable selection. Explain why the remaining estimates change between the two models.

Shuttle Mission | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Temperature | 53 | 57 | 58 | 63 | 66 | 67 | 67 | 67 | 68 | 69 | 70 | 70 |

Damaged | 5 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |

Undamaged | 1 | 5 | 5 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 6 |

Shuttle Mission | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
---|---|---|---|---|---|---|---|---|---|---|---|

Temperature | 70 | 70 | 71 | 73 | 75 | 75 | 76 | 76 | 78 | 79 | 81 |

Damaged | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

Undamaged | 5 | 6 | 6 | 6 | 6 | 5 | 6 | 6 | 6 | 6 | 6 |

- Each column of the table above represents a different shuttle mission. Examine these data and describe what you observe with respect to the relationship between temperatures and damaged O-rings.
- Failures have been coded as 1 for a damaged O-ring and 0 for an undamaged O-ring, and a logistic regression model was fit to these data. A summary of this model is given below. Describe the key components of this summary table in words.
Estimate Std. Error z-value Pr(>|z|) (Intercept) 11.6630 3.2963 3.54 0.0004 Temperature −0.2162 0.0532 −4.07 0.0000 - Write out the logistic model using the point estimates of the model parameters.
- Based on the model, do you think concerns regarding O-rings are justified? Explain.

Estimate | SE | Z | Pr(>|Z|) | |

(Intercept) | 33.5095 | 9.9053 | 3.38 | 0.0007 |
---|---|---|---|---|

sex_male | −1.4207 | 0.6457 | −2.20 | 0.0278 |

skull_width | −0.2787 | 0.1226 | −2.27 | 0.0231 |

total_length | 0.5687 | 0.1322 | 4.30 | 0.0000 |

tail_length | −1.8057 | 0.3599 | −5.02 | 0.0000 |

- Write out the form of the model. Also identify which of the variables are positively associated when controlling for other variables.
- Suppose we see a brushtail possum at a zoo in the US, and a sign says the possum had been captured in the wild in Australia, but it doesn't say which part of Australia. However, the sign does indicate that the possum is male, its skull is about 63 mm wide, its tail is 37 cm long, and its total length is 83 cm. What is the reduced model's computed probability that this possum is from Victoria? How confident are you in the model's accuracy of this probability calculation?

- The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as [latex]displaystylelogleft(frac{hat{p}}{1-hat{p}}right)=11.6630-0.2162timestext{Temperature}[/latex] where [latex]hat{p}[/latex] is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature: [latex]displaystylebegin{array}hat{p}_{57}=0.341hfill&hat{p}_{59}=0.251hfill&hat{p}_{61}=0.179hfill&hat{p}_{63}=0.124\hat{p}_{67}=0.084hfill&hat{p}_{67}=0.056hfill&hat{p}_{69}=0.037hfill&hat{p}_{71}-0.024end{array}[/latex]
- Add the model-estimated probabilities from part 1 on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.
- Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model's validity.

- Classify hypothesis tests by type.
- Conduct and interpret hypothesis tests for two population means, population standard deviations known.
- Conduct and interpret hypothesis tests for two population means, population standard deviations unknown.
- Conduct and interpret hypothesis tests for two population proportions.
- Conduct and interpret hypothesis tests for matched or paired samples.

NOTE

Independent groups (samples are independent)

- Test of two population means.
- Test of two population proportions.

Matched or paired samples (samples are dependent)

- Test of the two population proportions by testing one population mean of differences.

]]>

*O*= observed values*E*= expected values*i*= the number of rows in the table*j*= the number of columns in the table

Type of Volunteer | 1–3 Hours | 4–6 Hours | 7–9 Hours | Row Total |
---|---|---|---|---|

Community College Students | 111 | 96 | 48 | 255 |

Four-Year College Students | 96 | 133 | 61 | 290 |

Nonstudents | 91 | 150 | 53 | 294 |

Column Total | 298 | 379 | 162 | 839 |

Number of Hours Worked Per Week by Volunteer Type (Expected)
The table contains expected (E) values (data) | |||
---|---|---|---|

Type of Volunteer | 1–3 Hours | 4–6 Hours | 7–9 Hours |

Community College Students | 90.57 | 115.19 | 49.24 |

Four-Year College Students | 103.00 | 131.00 | 56.00 |

Nonstudents | 104.42 | 132.81 | 56.77 |

**Probability statement:***p*-value=*P*(*χ ^{2}* > 12.99) = 0.0113

**Compare α and the p-value:** Since no

**Make a decision:** Since *α* > *p*-value, reject *H _{0}*. This means that the factors are not independent.

Press the `MATRX`

key and arrow over to `EDIT`

. Press `1:[A]`

. Press `3 ENTER 3 ENTER`

. Enter the table values by row from the table. Press `ENTER`

after each. Press `2nd QUIT`

. Press`STAT`

and arrow over to `TESTS`

. Arrow down to `C:χ2-TEST`

. Press `ENTER`

. You should see `Observed:[A] and Expected:[B]`

. Arrow down to `Calculate`

. Press `ENTER`

. The test statistic is 12.9909 and the *p*-value = 0.0113. Do the procedure a second time, but arrow down to `Draw`

instead of `calculate`

.

The Bureau of Labor Statistics gathers data about employment in the United States. A sample is taken to calculate the number of U.S. citizens working in one of several industry sectors over time. The table below shows the results:

Industry Sector | 2000 | 2010 | 2020 | Total |
---|---|---|---|---|

Nonagriculture wage and salary | 13,243 | 13,044 | 15,018 | 41,305 |

Goods-producing, excluding agriculture | 2,457 | 1,771 | 1,950 | 6,178 |

Services-providing | 10,786 | 11,273 | 13,068 | 35,127 |

Agriculture, forestry, fishing, and hunting | 240 | 214 | 201 | 655 |

Nonagriculture self-employed and unpaid family worker | 931 | 894 | 972 | 2,797 |

Secondary wage and salary jobs in agriculture and private household industries | 14 | 11 | 11 | 36 |

Secondary jobs as a self-employed or unpaid family worker | 196 | 144 | 152 | 492 |

Total | 27,867 | 27,351 | 31,372 | 86,590 |

We want to know if the change in the number of jobs is independent of the change in years. State the null and alternative hypotheses and the degrees of freedom.

Press the `MATRX`

key and arrow over to `EDIT`

. Press `1:[A]`

. Press `3 ENTER 3 ENTER`

. Enter the table values by row. Press `ENTER`

after each. Press `2nd QUIT`

. Press `STAT`

and arrow over to `TESTS`

. Arrow down to `C:χ2-TEST`

. Press `ENTER`

. You should see `Observed:[A] and Expected:[B]`

. Arrow down to `Calculate`

. Press `ENTER`

. The test statistic is 227.73 and the*p*−value = 5.90E - 42 = 0. Do the procedure a second time but arrow down to `Draw`

instead of`calculate`

.

Need to Succeed in School vs. Anxiety Level | ||||||
---|---|---|---|---|---|---|

Need to Succeed in School | High Anxiety | Med-high Anxiety | Medium Anxiety | Med-low Anxiety | Low Anxiety | Row Total |

High Need | 35 | 42 | 53 | 15 | 10 | 155 |

Medium Need | 18 | 48 | 63 | 33 | 31 | 193 |

Low Need | 4 | 5 | 11 | 15 | 17 | 52 |

Column Total | 57 | 95 | 127 | 63 | 58 | 400 |

- How many high anxiety level students are expected to have a high need to succeed in school?
- If the two variables are independent, how many students do you expect to have a low need to succeed in school and a med-low level of anxiety?
- [latex]displaystyle{E}=frac{(text{row total})(text{column total})}{text{total surveyed}}[/latex] = ________
- The expected number of students who have a med-low anxiety level and a low need to succeed in school is about ________.

Solution:

- The column total for a high anxiety level is 57. The row total for high need to succeed in school is 155. The sample size or total surveyed is 400. [latex]displaystyle{E}=frac{(text{row total})(text{column total})}{text{total surveyed}}=frac{155cdot57}{400}=22.09[/latex] The expected number of students who have a high anxiety level and a high need to succeed in school is about 22.
- The column total for a med-low anxiety level is 63. The row total for a low need to succeed in school is 52. The sample size or total surveyed is 400.
- [latex]displaystyle{E}=frac{(text{row total})(text{column total})}{text{total surveyed}} = 8.19[/latex]
- 8

Refer back to the information in the Try It about the Bureau of Labor Statistics. How many service providing jobs are there expected to be in 2020? How many nonagriculture wage and salary jobs are there expected to be in 2020?

12,727, 14,965

Distribution of Living Arrangements for College Males and College Females | ||||
---|---|---|---|---|

Dormitory | Apartment | With Parents | Other | |

Males | 72 | 84 | 49 | 45 |

Females | 91 | 86 | 88 | 35 |

Press the *p*-value = 0.0175. Do the procedure a second time but arrow down to **Compare ***α* and the *p*-value: Since no *α* is given, assume *α* = 0.05. *p*-value = 0.0175. *α* > *p*-value.
**Make a decision:** Since *α* > *p*-value, reject *H*_{0}. This means that the distributions are not the same.
**Conclusion:** At a 5% level of significance, from the data, there is sufficient evidence to conclude that the distributions of living arrangements for male and female college students are not the same.
Notice that the conclusion is only that the distributions are not the same. We cannot use the test for homogeneity to draw any conclusions about how they differ.

`MATRX`

key and arrow over to `EDIT`

. Press `1:[A]`

. Press `2 ENTER 4 ENTER`

. Enter the table values by row. Press `ENTER`

after each. Press `2nd QUIT`

. Press `STAT`

and arrow over to `TESTS`

. Arrow down to `C:χ2-TEST`

. Press `ENTER`

. You should see `Observed:[A] and Expected:[B]`

. Arrow down to `Calculate`

. Press `ENTER`

. The test statistic is 10.1287 and the `Draw`

instead of `calculate`

.
Do families and singles have the same distribution of cars? Use a level of significance of 0.05. Suppose that 100 randomly selected families and 200 randomly selected singles were asked what type of car they drove: sport, sedan, hatchback, truck, van/SUV. The results are shown in the table. Do families and singles have the same distribution of cars? Test at a level of significance of 0.05.

Sport | Sedan | Hatchback | Truck | Van/SUV | |
---|---|---|---|---|---|

Family | 5 | 15 | 35 | 17 | 28 |

Single | 45 | 65 | 37 | 46 | 7 |

Both before and after a recent earthquake, surveys were conducted asking voters which of the three candidates they planned on voting for in the upcoming city council election. Has there been a change since the earthquake? Use a level of significance of 0.05. The table below shows the results of the survey. Has there been a change in the distribution of voter preferences since the earthquake?

Perez | Chung | Stevens | |

Before | 167 | 128 | 135 |

After | 214 | 197 | 225 |

Solution:

*H _{0}*: The distribution of voter preferences was the same before and after the earthquake.

Press the `MATRX`

key and arrow over to `EDIT`

. Press `1:[A]`

. Press `2 ENTER 3 ENTER`

. Enter the table values by row. Press `ENTER`

after each. Press `2nd QUIT`

. Press `STAT`

and arrow over to `TESTS`

. Arrow down to `C:χ2-TEST`

. Press `ENTER`

. You should see`Observed:[A] and Expected:[B]`

. Arrow down to `Calculate`

. Press `ENTER`

. The test statistic is 3.2603 and the *p*-value = 0.1959. Do the procedure a second time but arrow down to `Draw`

instead of `calculate`

.

**Compare α and the p-value:**

**Make a decision:** Since *α* < *p*-value, do not reject *H _{o}*.

**Conclusion:** At a 5% level of significance, from the data, there is insufficient evidence to conclude that the distribution of voter preferences was not the same before and after the earthquake.

Ivy League schools receive many applications, but only some can be accepted. At the schools listed in the table, two types of applications are accepted: regular and early decision.

Application Type Accepted | Brown | Columbia | Cornell | Dartmouth | Penn | Yale |
---|---|---|---|---|---|---|

Regular | 2,115 | 1,792 | 5,306 | 1,734 | 2,685 | 1,245 |

Early Decision | 577 | 627 | 1,228 | 444 | 1,195 | 761 |

*H _{0}*: The distribution of regular applications accepted is the same as the distribution of early applications accepted.

*H _{a}*: The distribution of regular applications accepted is not the same as the distribution of early applications accepted.

Press the `MATRX`

key and arrow over to `EDIT`

. Press `1:[A]`

. Press `3 ENTER 3 ENTER`

. Enter the table values by row. Press`ENTER`

after each. Press `2nd QUIT`

. Press `STAT`

and arrow over to `TESTS`

. Arrow down to`C:χ2-TEST`

. Press `ENTER`

. You should see `Observed:[A] and Expected:[B]`

. Arrow down to`Calculate`

. Press `ENTER`

. The test statistic is 430.06 and the *p*-value = 9.80E-91. Do the procedure a second time but arrow down to `Draw`

instead of `calculate`

.

Data from the Insurance Institute for Highway Safety, 2013. Available online at www.iihs.org/iihs/ratings (accessed May 24, 2013).

“Energy use (kg of oil equivalent per capita).” The World Bank, 2013. Available online at http://data.worldbank.org/indicator/EG.USE.PCAP.KG.OE/countries (accessed May 24, 2013).

“Parent and Family Involvement Survey of 2007 National Household Education Survey Program (NHES),” U.S. Department of Education, National Center for Education Statistics. Available online at http://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2009030 (accessed May 24, 2013).

“Parent and Family Involvement Survey of 2007 National Household Education Survey Program (NHES),” U.S. Department of Education, National Center for Education Statistics. Available online at http://nces.ed.gov/pubs2009/2009030_sup.pdf (accessed May 24, 2013).

**Goodness-of-Fit:**Use the goodness-of-fit test to decide whether a population with an unknown distribution "fits" a known distribution. In this case there will be a single qualitative survey question or a single outcome of an experiment from a single population. Goodness-of-Fit is typically used to see if the population is uniform (all outcomes occur with equal frequency), the population is normal, or the population is the same as another population with a known distribution. The null and alternative hypotheses are:*H*: The population fits the given distribution._{0}*H*: The population does not fit the given distribution._{a}**Independence:**Use the test for independence to decide whether two variables (factors) are independent or dependent. In this case there will be two qualitative survey questions or experiments and a contingency table will be constructed. The goal is to see if the two variables are unrelated (independent) or related (dependent). The null and alternative hypotheses are:*H*: The two variables (factors) are independent._{0}*H*: The two variables (factors) are dependent._{a}**Homogeneity:**Use the test for homogeneity to decide if two populations with unknown distributions have the same distribution as each other. In this case there will be a single qualitative survey question or experiment given to two different populations. The null and alternative hypotheses are:*H*: The two populations follow the same distribution._{0}*H*: The two populations have different distributions._{a}

where:

*n*= the total number of data*s*^{2}= sample variance*σ*^{2}= population variance

Math instructors are not only interested in how their students do on exams, on average, but how the exam scores vary. To many instructors, the variance (or standard deviation) may be more important than the average.
Suppose a math instructor believes that the standard deviation for his final exam is five points. One of his best students thinks otherwise. The student claims that the standard deviation is more than five points. If the student were to conduct a hypothesis test, what would the null and alternative hypotheses be?

A scuba instructor wants to record the collective depths each of his students dives during their checkout. He is interested in how the depths vary, even though everyone should have been at the same depth. He believes the standard deviation is three feet. His assistant thinks the standard deviation is less than three feet. If the instructor were to conduct a test, what would the null and alternative hypotheses be?

With individual lines at its various windows, a post office finds that the standard deviation for normally distributed waiting times for customers on Friday afternoon is 7.2 minutes. The post office experiments with a single, main waiting line and finds that for a random sample of 25 customers, the waiting times for customers have a standard deviation of 3.5 minutes.

With a significance level of 5%, test the claim that **a single line causes lower variation among waiting times (shorter waiting times) for customers**.

The FCC conducts broadband speed tests to measure how much data per second passes between a consumer’s computer and the internet. As of August of 2012, the standard deviation of Internet speeds across Internet Service Providers (ISPs) was 12.2 percent. Suppose a sample of 15 ISPs is taken, and the standard deviation is 13.2. An analyst claims that the standard deviation of speeds is more than what was reported. State the null and alternative hypotheses, compute the degrees of freedom, the test statistic, sketch the graph of the *p*-value, and draw a conclusion. Test at the 1% significance level.

“AppleInsider Price Guides.” Apple Insider, 2013. Available online at http://appleinsider.com/mac_price_guide (accessed May 14, 2013).

Data from the World Bank, June 5, 2012.

χ2= (n−1)⋅s2σ2 Test of a single variance statistic where:

Test of a Single Variance

- Use the test to determine variation.
- The degrees of freedom is the number of samples – 1.
- The test statistic is (n–1)⋅s2σ2, where
*n*= the total number of data,*s*^{2}= sample variance, and*σ*^{2}= population variance. - The test may be left-, right-, or two-tailed.

1. If the number of degrees of freedom for a chi-square distribution is 25, what is the population mean and standard deviation?

2. If *df* > 90, the distribution is _____________. If *df* = 15, the distribution is ________________.

3. When does the chi-square curve approximate a normal distribution?

4. Where is *μ* located on a chi-square curve?

5. Is it more likely the *df* is 90, 20, or two in the graph?

*Decide whether the following statements are true or false.*

6. As the number of degrees of freedom increases, the graph of the chi-square distribution looks more and more symmetrical.

7. The standard deviation of the chi-square distribution is twice the mean.

8. The mean and the median of the chi-square distribution are the same if *df* = 24.

9. An archeologist is calculating the distribution of the frequency of the number of artifacts she finds in a dig site. Based on previous digs, the archeologist creates an expected distribution broken down by grid sections in the dig site. Once the site has been fully excavated, she compares the actual number of artifacts found in each grid section to see if her expectation was accurate.

10. An economist is deriving a model to predict outcomes on the stock market. He creates a list of expected points on the stock market index for the next two weeks. At the close of each day’s trading, he records the actual points on the index. He wants to see how well his model matched what actually happened.

11. A personal trainer is putting together a weight-lifting program for her clients. For a 90-day program, she expects each client to lift a specific maximum weight each week. As she goes along, she records the actual maximum weights her clients lifted. She wants to know how well her expectations met with what was observed.

Grade | Proportion |
---|---|

A | 0.25 |

B | 0.30 |

C | 0.35 |

D | 0.10 |

The actual distribution for a class of 20 is in the table below.

Grade | Frequency |
---|---|

A | 7 |

B | 7 |

C | 5 |

D | 1 |

12. df= ______

13. State the null and alternative hypotheses.

16. At the 5% significance level, what can you conclude?

Ethnicity | Number of Cases |
---|---|

White | 2,229 |

Hispanic | 1,157 |

Black/African-American | 457 |

Asian, Pacific Islander | 232 |

Total = 4,075 |

Ethnicity | Percentage of total county population | Number expected (round to two decimal places) |
---|---|---|

White | 42.9% | 1748.18 |

Hispanic | 26.7% | |

Black/African-American | 2.6% | |

Asian, Pacific Islander | 27.8% | |

Total = 100% |

17. If the ethnicities of AIDS victims followed the ethnicities of the total county population, fill in the expected number of cases per ethnic group.*Perform a goodness-of-fit test to determine whether the occurrence of AIDS cases follows the ethnicities of the general population of Santa Clara County.*

20. Is this a right-tailed, left-tailed, or two-tailed test?

21. degrees of freedom = _______

*22. χ ^{2}* test statistic = _______

*23. p*-value = _______

24. Graph the situation. Label and scale the horizontal axis. Mark the mean and test statistic. Shade in the region corresponding to the *p*-value.

Let *α* = 0.05

- Decision: ________________
- Reason for the Decision: ________________
- Conclusion (write out in complete sentences): ________________

25. Does it appear that the pattern of AIDS cases in Santa Clara County corresponds to the distribution of ethnic groups in this county? Why or why not?

*For each problem, use a solution sheet to solve the hypothesis test problem. Go to Appendix E for the chi-square solution sheet. Round expected frequency to two decimal places.*

26. A six-sided die is rolled 120 times. Fill in the expected frequency column. Then, conduct a hypothesis test to determine if the die is fair. The data in the table below are the result of the 120 rolls.

Face Value | Frequency | Expected Frequency |
---|---|---|

1 | 15 | |

2 | 29 | |

3 | 16 | |

4 | 15 | |

5 | 30 | |

6 | 15 |

27. The marital status distribution of the U.S. male population, ages 15 and older, is as shown in the table below.

Marital Status | Percent | Expected Frequency |
---|---|---|

never married | 31.3 | |

married | 56.1 | |

widowed | 2.5 | |

divorced/separated | 10.1 |

Suppose that a random sample of 400 U.S. young adult males, 18 to 24 years old, yielded the following frequency distribution. We are interested in whether this age group of males fits the distribution of the U.S. adult population. Calculate the frequency one would expect when surveying 400 people. Fill in the table, rounding to two decimal places.

Marital Status | Frequency |
---|---|

never married | 140 |

married | 238 |

widowed | 2 |

divorced/separated | 20 |

*Use the following information to answer the next two exercises:* The columns in the table below contain the Race/Ethnicity of U.S. Public Schools for a recent year, the percentages for the Advanced Placement Examinee Population for that class, and the Overall Student Population. Suppose the right column contains the result of a survey of 1,000 local students from that year who took an AP Exam.

Race/Ethnicity | AP Examinee Population | Overall Student Population | Survey Frequency |
---|---|---|---|

Asian, Asian American, or Pacific Islander | 10.2% | 5.4% | 113 |

Black or African-American | 8.2% | 14.5% | 94 |

Hispanic or Latino | 15.5% | 15.9% | 136 |

American Indian or Alaska Native | 0.6% | 1.2% | 10 |

White | 59.4% | 61.6% | 604 |

Not reported/other | 6.1% | 1.4% | 43 |

28. Perform a goodness-of-fit test to determine whether the local results follow the distribution of the U.S. overall student population based on ethnicity.

29. Perform a goodness-of-fit test to determine whether the local results follow the distribution of U.S. AP examinee population, based on ethnicity.

30. The City of South Lake Tahoe, CA, has an Asian population of 1,419 people, out of a total population of 23,609. Suppose that a survey of 1,419 self-reported Asians in the Manhattan, NY, area yielded the data in the table below. Conduct a goodness-of-fit test to determine if the self-reported sub-groups of Asians in the Manhattan area fit that of the Lake Tahoe area.

Race | Lake Tahoe Frequency | Manhattan Frequency |
---|---|---|

Asian Indian | 131 | 174 |

Chinese | 118 | 557 |

Filipino | 1,045 | 518 |

Japanese | 80 | 54 |

Korean | 12 | 29 |

Vietnamese | 9 | 21 |

Other | 24 | 66 |

31. Conduct a goodness-of-fit test to determine if the actual college majors of graduating females fit the distribution of their expected majors.

Major | Women - Expected Major | Women - Actual Major |
---|---|---|

Arts & Humanities | 14.0% | 670 |

Biological Sciences | 8.4% | 410 |

Business | 13.1% | 685 |

Education | 13.0% | 650 |

Engineering | 2.6% | 145 |

Physical Sciences | 2.6% | 125 |

Professional | 18.9% | 975 |

Social Sciences | 13.0% | 605 |

Technical | 0.4% | 15 |

Other | 5.8% | 300 |

Undecided | 8.0% | 420 |

32. Conduct a goodness-of-fit test to determine if the actual college majors of graduating males fit the distribution of their expected majors.

Major | Men - Expected Major | Men - Actual Major |
---|---|---|

Arts & Humanities | 11.0% | 600 |

Biological Sciences | 6.7% | 330 |

Business | 22.7% | 1130 |

Education | 5.8% | 305 |

Engineering | 15.6% | 800 |

Physical Sciences | 3.6% | 175 |

Professional | 9.3% | 460 |

Social Sciences | 7.6% | 370 |

Technical | 1.8% | 90 |

Other | 8.2% | 400 |

Undecided | 6.6% | 340 |

*Read the statement and decide whether it is true or false.*

33. In a goodness-of-fit test, the expected values are the values we would expect if the null hypothesis were true.

34. In general, if the observed values and expected values of a goodness-of-fit test are not close together, then the test statistic can get very large and on a graph will be way out in the right tail.

35. Use a goodness-of-fit test to determine if high school principals believe that students are absent equally during the week or not.

36. The test to use to determine if a six-sided die is fair is a goodness-of-fit test.

37. In a goodness-of fit test, if the *p*-value is 0.0113, in general, do not reject the null hypothesis.

38. A sample of 212 commercial businesses was surveyed for recycling one commodity; a commodity here means any one type of recyclable material such as plastic or aluminum. the table below shows the business categories in the survey, the sample size of each category, and the number of businesses in each category that recycle one commodity. Based on the study, on average half of the businesses were expected to be recycling one commodity. As a result, the last column shows the expected number of businesses in each category that recycle one commodity. At the 5% significance level, perform a hypothesis test to determine if the observed number of businesses that recycle one commodity follows the uniform distribution of the expected values.

Business Type | Number in class | Observed Number that recycle one commodity | Expected number that recycle one commodity |
---|---|---|---|

Office | 35 | 19 | 17.5 |

Retail/Wholesale | 48 | 27 | 24 |

Food/Restaurants | 53 | 35 | 26.5 |

Manufacturing/Medical | 52 | 21 | 26 |

Hotel/Mixed | 24 | 9 | 12 |

39. The table below contains information from a survey among 499 participants classified according to their age groups. The second column shows the percentage of obese people per age class among the study participants. The last column comes from a different study at the national level that shows the corresponding percentages of obese people in the same age classes in the USA. Perform a hypothesis test at the 5% significance level to determine whether the survey participants are a representative sample of the USA obese population.

Age Class (Years) | Obese (Percentage) | Expected USA average (Percentage) |
---|---|---|

20–30 | 75.0 | 32.6 |

31–40 | 26.5 | 32.6 |

41–50 | 13.6 | 36.6 |

51–60 | 21.9 | 36.6 |

61–70 | 21.0 | 39.7 |

*Determine the appropriate test to be used in the next three exercises.*

40. A pharmaceutical company is interested in the relationship between age and presentation of symptoms for a common viral infection. A random sample is taken of 500 people with the infection across different age groups.

41. The owner of a baseball team is interested in the relationship between player salaries and team winning percentage. He takes a random sample of 100 players from different organizations.

42. A marathon runner is interested in the relationship between the brand of shoes runners wear and their run times. She takes a random sample of 50 runners and records their run times as well as the brand of shoes they were wearing.

Traveling Distance | Third class | Second class | First class | Total |
---|---|---|---|---|

1–100 miles | 21 | 14 | 6 | 41 |

101–200 miles | 18 | 16 | 8 | 42 |

201–300 miles | 16 | 17 | 15 | 48 |

301–400 miles | 12 | 14 | 21 | 47 |

401–500 miles | 6 | 6 | 10 | 22 |

Total | 73 | 67 | 60 | 200 |

43. State the hypotheses.
*44. H*_{0}: _______*45. H*_{a}: _______

47. How many passengers are expected to travel between 201 and 300 miles and purchase second-class tickets?

48. How many passengers are expected to travel between 401 and 500 miles and purchase first-class tickets?

49. What is the test statistic?

50. What is the *p*-value?

51. What can you conclude at the 5% level of significance?

52. Complete the table.

Smoking Level Per Day | African American | Native Hawaiian | Latino | Japanese Americans | White | TOTALS |
---|---|---|---|---|---|---|

1-10 | ||||||

11-20 | ||||||

21-30 | ||||||

31+ | ||||||

TOTALS |

53. State the hypotheses.

56. Enter expected values in the table. Round to two decimal places.

Calculate the following values:

58. χ2 test statistic = ______

*59. p*-value = ______

60. Is this a right-tailed, left-tailed, or two-tailed test? Explain why.

61. Graph the situation. Label and scale the horizontal axis. Mark the mean and test statistic. Shade in the region corresponding to the *p*-value.

*α* = 0.05

- Decision: ___________________
- Reason for the decision: ___________________
- Conclusion (write out in a complete sentence): ___________________

- Decision: ___________________
- Reason for the decision: ___________________
- Conclusion (write out in a complete sentence): ___________________

*For each problem, use a solution sheet to solve the hypothesis test problem. Go to Appendix E for the chi-square solution sheet. Round expected frequency to two decimal places.*

64. A recent debate about where in the United States skiers believe the skiing is best prompted the following survey. Test to see if the best ski area is independent of the level of the skier.

U.S. Ski Area | Beginner | Intermediate | Advanced |
---|---|---|---|

Tahoe | 20 | 30 | 40 |

Utah | 10 | 30 | 60 |

Colorado | 10 | 40 | 50 |

65. Car manufacturers are interested in whether there is a relationship between the size of car an individual drives and the number of people in the driver’s family (that is, whether car size and family size are independent). To test this, suppose that 800 car owners were randomly surveyed with the results in the table. Conduct a test of independence.

Family Size | Sub & Compact | Mid-size | Full-size | Van & Truck |
---|---|---|---|---|

1 | 20 | 35 | 40 | 35 |

2 | 20 | 50 | 70 | 80 |

3–4 | 20 | 50 | 100 | 90 |

5+ | 20 | 30 | 70 | 70 |

66. College students may be interested in whether or not their majors have any effect on starting salaries after graduation. Suppose that 300 recent graduates were surveyed as to their majors in college and their starting salaries after graduation. The table below shows the data. Conduct a test of independence.

Major | < $50,000 | $50,000 – $68,999 | $69,000 + |
---|---|---|---|

English | 5 | 20 | 5 |

Engineering | 10 | 30 | 60 |

Nursing | 10 | 15 | 15 |

Business | 10 | 20 | 30 |

Psychology | 20 | 30 | 20 |

67. Some travel agents claim that honeymoon hot spots vary according to age of the bride. Suppose that 280 recent brides were interviewed as to where they spent their honeymoons. The information is given in Table. Conduct a test of independence.

Location | 20–29 | 30–39 | 40–49 | 50 and over |
---|---|---|---|---|

Niagara Falls | 15 | 25 | 25 | 20 |

Poconos | 15 | 25 | 25 | 10 |

Europe | 10 | 25 | 15 | 5 |

Virgin Islands | 20 | 25 | 15 | 5 |

68. A manager of a sports club keeps information concerning the main sport in which members participate and their ages. To test whether there is a relationship between the age of a member and his or her choice of sport, 643 members of the sports club are randomly selected. Conduct a test of independence.

Sport | 18 - 25 | 26 - 30 | 31 - 40 | 41 and over |
---|---|---|---|---|

racquetball | 42 | 58 | 30 | 46 |

tennis | 58 | 76 | 38 | 65 |

swimming | 72 | 60 | 65 | 33 |

69. A major food manufacturer is concerned that the sales for its skinny french fries have been decreasing. As a part of a feasibility study, the company conducts research into the types of fries sold across the country to determine if the type of fries sold is independent of the area of the country. The results of the study are shown in the table. Conduct a test of independence.

Type of Fries | Northeast | South | Central | West |
---|---|---|---|---|

skinny fries | 70 | 50 | 20 | 25 |

curly fries | 100 | 60 | 15 | 30 |

steak fries | 20 | 40 | 10 | 10 |

70. According to Dan Lenard, an independent insurance agent in the Buffalo, N.Y. area, the following is a breakdown of the amount of life insurance purchased by males in the following age groups. He is interested in whether the age of the male and the amount of life insurance purchased are independent events. Conduct a test for independence.

Age of Males | None | < $200,000 | $200,000–$400,000 | $401,001–$1,000,000 | $1,000,001+ |
---|---|---|---|---|---|

20–29 | 40 | 15 | 40 | 0 | 5 |

30–39 | 35 | 5 | 20 | 20 | 10 |

40–49 | 20 | 0 | 30 | 0 | 30 |

50+ | 40 | 30 | 15 | 15 | 10 |

71. Suppose that 600 thirty-year-olds were surveyed to determine whether or not there is a relationship between the level of education an individual has and salary. Conduct a test of independence.

Annual Salary | Not a high school graduate | High school graduate | College graduate | Masters or doctorate |
---|---|---|---|---|

< $30,000 | 15 | 25 | 10 | 5 |

$30,000–$40,000 | 20 | 40 | 70 | 30 |

$40,000–$50,000 | 10 | 20 | 40 | 55 |

$50,000–$60,000 | 5 | 10 | 20 | 60 |

$60,000+ | 0 | 5 | 10 | 150 |

*Read the statement and decide whether it is true or false.*

72. The number of degrees of freedom for a test of independence is equal to the sample size minus one.

73. The test for independence uses tables of observed and expected data values.

74. The test to use when determining if the college or university a student chooses to attend is related to his or her socioeconomic status is a test for independence.

75. In a test of independence, the expected number is equal to the row total multiplied by the column total divided by the total surveyed.

76. An ice cream maker performs a nationwide survey about favorite flavors of ice cream in different geographic areas of the U.S. Based on the table, do the numbers suggest that geographic location is independent of favorite ice cream flavors? Test at the 5% significance level.

U.S. region/Flavor | Strawberry | Chocolate | Vanilla | Rocky Road | Mint Chocolate Chip | Pistachio | Row total |
---|---|---|---|---|---|---|---|

West | 12 | 21 | 22 | 19 | 15 | 8 | 97 |

Midwest | 10 | 32 | 22 | 11 | 15 | 6 | 96 |

East | 8 | 31 | 27 | 8 | 15 | 7 | 96 |

South | 15 | 28 | 30 | 8 | 15 | 6 | 102 |

Column Total | 45 | 112 | 101 | 46 | 60 | 27 | 391 |

77. The table provides a recent survey of the youngest online entrepreneurs whose net worth is estimated at one million dollars or more. Their ages range from 17 to 30. Each cell in the table illustrates the number of entrepreneurs who correspond to the specific age group and their net worth. Are the ages and net worth independent? Perform a test of independence at the 5% significance level.

Age Group Net Worth Value (in millions of US dollars) | 1–5 | 6–24 | ≥25 | Row Total |
---|---|---|---|---|

17–25 | 8 | 7 | 5 | 20 |

26–30 | 6 | 5 | 9 | 20 |

Column Total | 14 | 12 | 14 | 40 |

78. A 2013 poll in California surveyed people about taxing sugar-sweetened beverages. The results are presented in the table, and are classified by ethnic group and response type. Are the poll responses independent of the participants’ ethnic group? Conduct a test of independence at the 5% significance level.

Opinion/Ethnicity | Asian-American | White/Non-Hispanic | African-American | Latino | Row Total |
---|---|---|---|---|---|

Against tax | 48 | 433 | 41 | 160 | 628 |

In Favor of tax | 54 | 234 | 24 | 147 | 459 |

No opinion | 16 | 43 | 16 | 19 | 84 |

Column Total | 118 | 710 | 71 | 272 | 1171 |

79. A math teacher wants to see if two of her classes have the same distribution of test scores. What test should she use?

80. What are the null and alternative hypotheses for 74?

81. A market researcher wants to see if two different stores have the same distribution of sales throughout the year. What type of test should he use?

82. A meteorologist wants to know if East and West Australia have the same distribution of storms. What type of test should she use?

83. What condition must be met to use the test for homogeneity?

*Use the following information to answer the next five exercises:* Do private practice doctors and hospital doctors have the same distribution of working hours? Suppose that a sample of 100 private practice doctors and 150 hospital doctors are selected at random and asked about the number of hours a week they work. The results are shown in the table.

20–30 | 30–40 | 40–50 | 50–60 | |
---|---|---|---|---|

Private Practice | 16 | 40 | 38 | 6 |

Hospital | 8 | 44 | 59 | 39 |

84. State the null and alternative hypotheses.

*85. df* = _______

86. What is the test statistic?

87. What is the *p*-value?

88. What can you conclude at the 5% significance level?

*For each word problem, use a solution sheet to solve the hypothesis test problem. Go to Appendix E for the chi-square solution sheet. Round expected frequency to two decimal places.*

89. A psychologist is interested in testing whether there is a difference in the distribution of personality types for business majors and social science majors. The results of the study are shown in the table. Conduct a test of homogeneity. Test at a 5% level of significance.

Open | Conscientious | Extrovert | Agreeable | Neurotic | |

Business | 41 | 52 | 46 | 61 | 58 |

Social Science | 72 | 75 | 63 | 80 | 65 |

90. Do men and women select different breakfasts? The breakfasts ordered by randomly selected men and women at a popular breakfast place is shown in the table. Conduct a test for homogeneity at a 5% level of significance.

French Toast | Pancakes | Waffles | Omelettes | |

Men | 47 | 35 | 28 | 53 |

Women | 65 | 59 | 55 | 60 |

91. A fisherman is interested in whether the distribution of fish caught in Green Valley Lake is the same as the distribution of fish caught in Echo Lake. Of the 191 randomly selected fish caught in Green Valley Lake, 105 were rainbow trout, 27 were other trout, 35 were bass, and 24 were catfish. Of the 293 randomly selected fish caught in Echo Lake, 115 were rainbow trout, 58 were other trout, 67 were bass, and 53 were catfish. Perform a test for homogeneity at a 5% level of significance.

92. In 2007, the United States had 1.5 million homeschooled students, according to the U.S. National Center for Education Statistics. In the table below you can see that parents decide to homeschool their children for different reasons, and some reasons are ranked by parents as more important than others. According to the survey results shown in the table, is the distribution of applicable reasons the same as the distribution of the most important reason? Provide your assessment at the 5% significance level. Did you expect the result you obtained?

Reasons for Homeschooling | Applicable Reason (in thousands of respondents) | Most Important Reason (in thousands of respondents) | Row Total |
---|---|---|---|

Concern about the environment of other schools | 1,321 | 309 | 1,630 |

Dissatisfaction with academic instruction at other schools | 1,096 | 258 | 1,354 |

To provide religious or moral instruction | 1,257 | 540 | 1,797 |

Child has special needs, other than physical or mental | 315 | 55 | 370 |

Nontraditional approach to child’s education | 984 | 99 | 1,083 |

Other reasons (e.g., finances, travel, family time, etc.) | 485 | 216 | 701 |

Column Total | 5,458 | 1,477 | 6,935 |

93. When looking at energy consumption, we are often interested in detecting trends over time and how they correlate among different countries. The information in the table shows the average energy use (in units of kg of oil equivalent per capita) in the USA and the joint European Union countries (EU) for the six-year period 2005 to 2010. Do the energy use values in these two areas come from the same distribution? Perform the analysis at the 5% significance level.

Year | European Union | United States | Row Total |
---|---|---|---|

2010 | 3,413 | 7,164 | 10,557 |

2009 | 3,302 | 7,057 | 10,359 |

2008 | 3,505 | 7,488 | 10,993 |

2007 | 3,537 | 7,758 | 11,295 |

2006 | 3,595 | 7,697 | 11,292 |

2005 | 3,613 | 7,847 | 11,460 |

Column Total | 45,011 | 20,965 | 65,976 |

94. The Insurance Institute for Highway Safety collects safety information about all types of cars every year, and publishes a report of Top Safety Picks among all cars, makes, and models. The table below presents the number of Top Safety Picks in six car categories for the two years 2009 and 2013. Analyze the table data to conclude whether the distribution of cars that earned the Top Safety Picks safety award has remained the same between 2009 and 2013. Derive your results at the 5% significance level.

Year Car Type | Small | Mid-Size | Large | Small SUV | Mid-Size SUV | Large SUV | Row Total |
---|---|---|---|---|---|---|---|

2009 | 12 | 22 | 10 | 10 | 27 | 6 | 87 |

2013 | 31 | 30 | 19 | 11 | 29 | 4 | 124 |

Column Total | 43 | 52 | 29 | 21 | 56 | 10 | 211 |

95. Which test do you use to decide whether an observed distribution is the same as an expected distribution?

96. What is the null hypothesis for the type of test from 90?

97. Which test would you use to decide whether two factors have a relationship?

98. Which test would you use to decide if two populations have the same distribution?

99. How are tests of independence similar to tests for homogeneity?

100. How are tests of independence different from tests for homogeneity?

*For each word problem, use a solution sheet to solve the hypothesis test problem. Go to Appendix E for the chi-square solution sheet. Round expected frequency to two decimal places.*

101. Is there a difference between the distribution of community college statistics students and the distribution of university statistics students in what technology they use on their homework? Of some randomly selected community college students, 43 used a computer, 102 used a calculator with built in statistics functions, and 65 used a table from the textbook. Of some randomly selected university students, 28 used a computer, 33 used a calculator with built in statistics functions, and 40 used a table from the textbook. Conduct an appropriate hypothesis test using a 0.05 level of significance.

Read the statement and decide whether it is true or false.

102. If *df* = 2, the chi-square distribution has a shape that reminds us of the exponential.
103.

- Explain why a goodness-of-fit test and a test of independence are generally right-tailed tests.
- If you did a left-tailed test, what would you be testing?

105. State the null and alternative hypotheses.

106. Is this a right-tailed, left-tailed, or two-tailed test?

107. What type of test should be used?

108. State the null and alternative hypotheses.

109. *df* = ________

110. What type of test should be used?

111. What is the test statistic?

112. What is the *p*-value?

113. What can you conclude at the 5% significance level?

114. Is the traveler disputing the claim about the average or about the variance?

115. A sample standard deviation of 15 minutes is the same as a sample variance of __________ minutes.

116. Is this a right-tailed, left-tailed, or two-tailed test?

*117. H _{0}*: __________

*118. df* = ________

119. chi-square test statistic = ________

*120. p*-value = ________

121. Graph the situation. Label and scale the horizontal axis. Mark the mean and test statistic. Shade the *p*-value.

122. Let *α* = 0.05

- Decision: ________
- Conclusion (write out in a complete sentence.): ________

123. How did you know to test the variance instead of the mean?

124. If an additional test were done on the claim of the average delay, which distribution would you use?

125. If an additional test were done on the claim of the average delay, but 45 flights were surveyed, which distribution would you use?

126. A plant manager is concerned her equipment may need recalibrating. It seems that the actual weight of the 15 oz. cereal boxes it fills has been fluctuating. The standard deviation should be at most 0.5 oz. In order to determine if the machine needs to be recalibrated, 84 randomly selected boxes of cereal from the next day’s production were weighed. The standard deviation of the 84 boxes was 0.54. Does the machine need to be recalibrated?

127. Consumers may be interested in whether the cost of a particular calculator varies from store to store. Based on surveying 43 stores, which yielded a sample mean of $84 and a sample standard deviation of $12, test the claim that the standard deviation is greater than $15.

128. Isabella, an accomplished **Bay to Breakers** runner, claims that the standard deviation for her time to run the 7.5 mile race is at most three minutes. To test her claim, Rupinder looks up five of her race times. They are 55 minutes, 61 minutes, 58 minutes, 63 minutes, and 57 minutes.

129. Airline companies are interested in the consistency of the number of babies on each flight, so that they have adequate safety equipment. They are also interested in the variation of the number of babies. Suppose that an airline executive believes the average number of babies on flights is six with a variance of nine at most. The airline conducts a survey. The results of the 18 flights surveyed give a sample average of 6.4 with a sample standard deviation of 3.9. Conduct a hypothesis test of the airline executive’s belief.

130. The number of births per woman in China is 1.6 down from 5.91 in 1966. This fertility rate has been attributed to the law passed in 1979 restricting births to one per woman. Suppose that a group of students studied whether or not the standard deviation of births per woman was greater than 0.75. They asked 50 women across China the number of births they had had. The results are shown in the table below. Does the students’ survey indicate that the standard deviation is greater than 0.75?

# of births | Frequency |
---|---|

0 | 5 |

1 | 30 |

2 | 10 |

3 | 5 |

131. According to an avid aquarist, the average number of fish in a 20-gallon tank is 10, with a standard deviation of two. His friend, also an aquarist, does not believe that the standard deviation is two. She counts the number of fish in 15 other 20-gallon tanks. Based on the results that follow, do you think that the standard deviation is different from two? Data: 11; 10; 9; 10; 10; 11; 11; 10; 12; 9; 7; 9; 11; 10; 11

132. The manager of "Frenchies" is concerned that patrons are not consistently receiving the same amount of French fries with each order. The chef claims that the standard deviation for a ten-ounce order of fries is at most 1.5 oz., but the manager thinks that it may be higher. He randomly weighs 49 orders of fries, which yields a mean of 11 oz. and a standard deviation of two oz.

133. You want to buy a specific computer. A sales representative of the manufacturer claims that retail stores sell this computer at an average price of $1,249 with a very narrow standard deviation of $25. You find a website that has a price comparison for the same computer at a series of stores as follows: $1,299; $1,229.99; $1,193.08; $1,279; $1,224.95; $1,229.99; $1,269.95; $1,249. Can you argue that pricing has a larger standard deviation than claimed by the manufacturer? Use the 5% significance level. As a potential buyer, what would be the practical conclusion from your analysis?

134. A company packages apples by weight. One of the weight grades is Class A apples. Class A apples have a mean weight of 150 g, and there is a maximum allowed weight tolerance of 5% above or below the mean for apples in the same consumer package. A batch of apples is selected to be included in a Class A apple package. Given the following apple weights of the batch, does the fruit comply with the Class A grade weight tolerance requirements. Conduct an appropriate hypothesis test.

(a) at the 5% significance level

(b) at the 1% significance level

Weights in selected apple batch (in grams): 158; 167; 149; 169; 164; 139; 154; 150; 157; 171; 152; 161; 141; 166; 172;

- Discuss basic ideas of linear regression and correlation.
- Create and interpret a line of best fit.
- Calculate and interpret the correlation coefficient.
- Calculate and interpret outliers.

In this chapter, you will be studying the simplest form of regression, "linear regression" with one independent variable (*x*). This involves data that fits a line in two dimensions. You will also study correlation which measures how strong the relationship is.
]]>

y=a+bx

where y=3+2x

y=–0.01+1.2x

Is the following an example of a linear equation?

*y* = –0.125 – 3.5*x*

The graph of a linear equation of the form *y* = *a* + *bx* is a **straight line**. Any line that is not vertical can be described by this equation.

Graph the equation *y* = –1 + 2*x*.

- Is the following an example of a linear equation? Why or why not?

Find the equation that expresses the **total cost** in terms of the **number of hours**required to complete the job.

Emma’s Extreme Sports hires hang-gliding instructors and pays them a fee of $50 per class as well as $20 per student in the class. The total cost Emma pays depends on the number of students in a class. Find the equation that expresses the total cost in terms of the number of students in a class.

For the linear equation *y* = *a* + *bx*, *b* = slope and *a* = *y*-intercept. From algebra recall that the slope is a number that describes the steepness of a line, and the *y*-intercept is the *y* coordinate of the point (0, *a*) where the line crosses the *y*-axis.

What are the independent and dependent variables? What is the *y*-intercept and what is the slope? Interpret them using complete sentences.

Ethan repairs household appliances like dishwashers and refrigerators. For each visit, he charges $25 plus $20 per hour of work. A linear equation that expresses the total amount of money Ethan earns per visit is *y* = 25 + 20*x*.

What are the independent and dependent variables? What is the *y*-intercept and what is the slope? Interpret them using complete sentences.

Data from the Centers for Disease Control and Prevention.

Data from the National Center for HIV, STD, and TB Prevention.

The most basic type of association is a linear association. This type of relationship can be defined algebraically by the equations used, numerically with actual or predicted data values, or graphically from a plotted curve. (Lines are classified as straight curves.) Algebraically, a linear equation typically takes the form ** y = mx + b**, where

(year) | (# of users) |
---|---|

2000 | 0.5 |

2002 | 20.0 |

2003 | 33.0 |

2004 | 47.0 |

- Enter your X data into list L1 and your Y data into list L2.
- Press 2nd STATPLOT ENTER to use Plot 1. On the input screen for PLOT 1, highlight On and press ENTER. (Make sure the other plots are OFF.)
- For TYPE: highlight the very first icon, which is the scatter plot, and press ENTER.
- For Xlist:, enter L1 ENTER and for Ylist: L2 ENTER.
- For Mark: it does not matter which symbol you highlight, but the square is the easiest to see. Press ENTER.
- Make sure there are no other equations that could be plotted. Press Y = and clear any equations out.
- Press the ZOOM key and then the number 9 (for menu item "ZoomStat") ; the calculator will fit the window to the data. You can press WINDOW to see the scaling of the axes.

X (hours practicing jump shot) | Y (points scored in a game) |
---|---|

5 | 15 |

7 | 22 |

9 | 28 |

10 | 31 |

11 | 33 |

12 | 36 |

A scatter plot shows the

x (third exam score) | y (final exam score) |
---|---|

65 | 175 |

67 | 133 |

71 | 185 |

71 | 163 |

66 | 126 |

75 | 198 |

67 | 153 |

70 | 163 |

71 | 159 |

69 | 151 |

69 | 159 |

X (depth in feet) | Y (maximum dive time) |
---|---|

50 | 80 |

60 | 55 |

70 | 45 |

80 | 35 |

90 | 25 |

100 | 22 |

The third exam score,

**Remember,** it is always important to plot a scatter diagram first. If the scatter plot indicates that there is a linear relationship between the variables, then it is reasonable to use a best fit line to make predictions for *y* given *x* within the domain of *x*-values in the sample data, **but not necessarily for x-values outside that domain**. You could use the line to predict the final exam score for a student who earned a grade of 73 on the third exam. You should NOT use the line to predict the final exam score for a student who earned a grade of 50 on the third exam, because 50 is not within the domain of the

- In the STAT list editor, enter the X data in list L1 and the Y data in list L2, paired so that the corresponding (
*x*,*y*) values are next to each other in the lists. (If a particular pair of values is repeated, enter it as many times as it appears in the data.) - On the STAT TESTS menu, scroll down with the cursor to select the LinRegTTest. (Be careful to select LinRegTTest, as some calculators may also have a different item called LinRegTInt.)
- On the LinRegTTest input screen enter: Xlist: L1 ; Ylist: L2 ; Freq: 1
- On the next line, at the prompt
*β*or*ρ*, highlight "≠ 0" and press ENTER - Leave the line for "RegEq:" blank
- Highlight Calculate and press ENTER.

- We are assuming your X data is already entered in list L1 and your Y data is in list L2
- Press 2nd STATPLOT ENTER to use Plot 1
- On the input screen for PLOT 1, highlightOn, and press ENTER
- For TYPE: highlight the very first icon which is the scatterplot and press ENTER
- Indicate Xlist: L1 and Ylist: L2
- For Mark: it does not matter which symbol you highlight.
- Press the ZOOM key and then the number 9 (for menu item "ZoomStat") ; the calculator will fit the window to the data
- To graph the best-fit line, press the "Y=" key and type the equation –173.5 + 4.83X into equation Y1. (The X key is immediately left of the STAT key). Press ZOOM 9 again to graph it.
- Optional: If you want to change the viewing window, press the WINDOW key. Enter your desired window using Xmin, Xmax, Ymin, Ymax

(a) A scatter plot showing data with a positive correlation. 0 <

*r*^{2}, when expressed as a percent, represents the percent of variation in the dependent (predicted) variable*y*that can be explained by variation in the independent (explanatory) variable*x*using the regression (best-fit) line.- 1 –
*r*^{2}, when expressed as a percentage, represents the percent of variation in*y*that is NOT explained by variation in*x*using the regression line. This can be seen as the scattering of the observed data points about the regression line.

- The symbol for the population correlation coefficient is
*ρ*, the Greek letter "rho." *ρ*= population correlation coefficient (unknown)*r*= sample correlation coefficient (known; calculated from sample data)

- If
*r*is significant and the scatter plot shows a linear trend, the line can be used to predict the value of*y*for values of*x*that are within the domain of observed*x*values. - If
*r*is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction. - If
*r*is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed*x*values in the data.

- Null Hypothesis:
*H*:_{0}*ρ*= 0 - Alternate Hypothesis:
*H*:_{a}*ρ*≠ 0

- Null Hypothesis
*H*: The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship(correlation) between_{0}*x*and*y*in the population. - Alternate Hypothesis
*H*: The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between_{a}*x*and*y*in the population.

- Method 1: Using the
*p*-value - Method 2: Using a table of critical values

- On the LinRegTTEST input screen, on the line prompt for
*β*or*ρ*, highlight "≠ 0" - The output screen shows the p-value on the line that reads "p =".
- (Most computer statistical software can calculate the
*p*-value.)

- Decision: Reject the null hypothesis.
- Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between
*x*and*y*because the correlation coefficient is significantly different from zero."

- Decision: DO NOT REJECT the null hypothesis.
- Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between
*x*and*y*because the correlation coefficient is NOT significantly different from zero."

- You will use technology to calculate the
*p*-value. The following describes the calculations to compute the test statistics and the*p*-value: - The
*p*-value is calculated using a*t*-distribution with*n*- 2 degrees of freedom. - The formula for the test statistic is [latex]displaystyle{t}=frac{{{r}sqrt{{{n}-{2}}}}}{sqrt{{{1}-{r}^{{2}}}}}[/latex]. The value of the test statistic,
*t*, is shown in the computer or calculator output along with the*p*-value. The test statistic*t*has the same sign as the correlation coefficient*r*. - The
*p*-value is the combined area in both tails.

*r*= –0.567 and the sample size,*n*, is 19. The*df*=*n*– 2 = 17. The critical value is –0.456. –0.567 < –0.456 so*r*is significant.*r*= 0.708 and the sample size,*n*, is nine. The*df*=*n*– 2 = 7. The critical value is 0.666. 0.708 > 0.666 so*r*is significant.*r*= 0.134 and the sample size,*n*, is 14. The*df*= 14 – 2 = 12. The critical value is 0.532. 0.134 is between –0.532 and 0.532 so*r*is not significant.*r*= 0 and the sample size,*n*, is five. No matter what the dfs are,*r*= 0 is between the two critical values so*r*is not significant.

- There is a linear relationship in the population that models the average value of
*y*for varying values of*x*. In other words, the expected value of*y*for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population.) - The
*y*values for any particular*x*value are normally distributed about the line. This implies that there are more*y*values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of*y*values lie on the line. - The standard deviations of the population
*y*values about the line are equal for each value of*x*. In other words, each of these normal distributions of*y*values has the same shape and spread about the line. - The residual errors are mutually independent (no pattern).
- The data are produced from a well-designed, random sample or randomized experiment.

**Linear:**In the population, there is a linear relationship that models the average value of*y*for different values of*x*.**Independent:**The residuals are assumed to be independent.**Normal:**The*y*values are distributed normally for any value of*x*.**Equal variance:**The standard deviation of the*y*values is equal for each*x*value.**Random:**The data are produced from a well-designed random sample or randomized experiment.

x (third exam score) | y (final exam score) |
---|---|

65 | 175 |

67 | 133 |

71 | 185 |

71 | 163 |

66 | 126 |

75 | 198 |

67 | 153 |

70 | 163 |

71 | 159 |

69 | 151 |

69 | 159 |

Table showing the scores on the final exam based on scores from the third exam.

Scatter plot showing the scores on the final exam based on scores from the third exam.
We examined the scatterplot and showed that the correlation coefficient is significant. We found the equation of the best-fit line for the final exam grade as a function of the grade on the third-exam. We can now use the least-squares regression line for prediction.
Suppose you want to estimate, or predict, the mean final exam score of statistics students who received 73 on the third exam. The exam scores **( x-values)** range from 65 to 75.

- What would you predict the final exam score to be for a student who scored a 66 on the third exam?
- What would you predict the final exam score to be for a student who scored a 90 on the third exam?

- 145.27
- The
*x*values in the data are between 65 and 75. Ninety is outside of the domain of the observed*x*values in the data (independent variable), so you cannot reliably predict the final exam score for this student. (Even though it is possible to enter 90 into the equation for*x*and calculate a corresponding*y*value, the*y*value that you get will not be reliable.)To understand really how unreliable the prediction can be outside of the observed*x*values observed in the data, make the substitution*x*= 90 into the equation.[latex]displaystylehat{{y}}=-{173.51}+{4.83}{({90})}={261.19}[/latex]The final-exam score is predicted to be 261.19. The largest the final-exam score can be is 200.

The *IQR* can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than (1.5)(*IQR*) below the first quartile or more than (1.5)(*IQR*) above the third quartile. Potential outliers always require further investigation.

### Note

A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality or they may be a key to understanding the data.
### example

For the following 13 real estate prices, calculate the *IQR* and determine if any prices are potential outliers. Prices are in dollars.
389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000; 488,800; 1,095,000
Solution:
Order the data from smallest to largest.
114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000; 1,095,000; 5,500,000
*M* = 488,800
*Q*1 = [latex]displaystylefrac{{{230},{500}+{387},{000}}}{{2}}[/latex] = 308,750
*Q*3 = [latex]displaystylefrac{{{639},{000}+{659},{000}}}{{2}}[/latex] = 649,000
*IQR* = 649,000 – 308,750 = 340,250
(1.5)(*IQR*) = (1.5)(340,250) = 510,375
*Q*1 – (1.5)(*IQR*) = 308,750 – 510,375 = –201,625
*Q*3 + (1.5)(*IQR*) = 649,000 + 510,375 = 1,159,375
No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier.

]]>1. What are the dependent and independent variables?

2. Find the equation that expresses the total fee in terms of the number of hours the equipment is rented.

3. Graph the equation from 2.

4. Find the equation that expresses the total fee in terms of the number of days the payment is late.

5. Graph the equation from 4.
6. Is the equation *y* = 10 + 5*x* – 3*x*^{2} linear? Why or why not?

7. Which of the following equations are linear?

a. *y* = 6*x* + 8

b. *y* + 7 = 3*x*

c. *y* – *x* = 8*x*^{2}

d. 4*y* = 8

8. Does the graph show a linear equation? Why or why not?

Year | # AIDS cases diagnosed | # AIDS deaths |

Pre-1981 | 91 | 29 |

1981 | 319 | 121 |

1982 | 1,170 | 453 |

1983 | 3,076 | 1,482 |

1984 | 6,240 | 3,466 |

1985 | 11,776 | 6,878 |

1986 | 19,032 | 11,987 |

1987 | 28,564 | 16,162 |

1988 | 35,447 | 20,868 |

1989 | 42,674 | 27,591 |

1990 | 48,634 | 31,335 |

1991 | 59,660 | 36,560 |

1992 | 78,530 | 41,055 |

1993 | 78,834 | 44,730 |

1994 | 71,874 | 49,095 |

1995 | 68,505 | 49,456 |

1996 | 59,347 | 38,510 |

1997 | 47,149 | 20,736 |

1998 | 38,393 | 19,005 |

1999 | 25,174 | 18,454 |

2000 | 25,522 | 17,347 |

2001 | 25,643 | 17,402 |

2002 | 26,464 | 16,371 |

Total | 802,118 | 489,093 |

9. Use the columns "year" and "# AIDS cases diagnosed. Why is “year” the independent variable and “# AIDS cases diagnosed.” the dependent variable (instead of the reverse)?

10. What are the independent and dependent variables?

11. What is the *y*-intercept and what is the slope? Interpret them using complete sentences.

12. What are the independent and dependent variables?

13. How many pounds of soil does the shoreline lose in a year?

14. What is the *y*-intercept? Interpret its meaning.

15. What are the slope and *y*-intercept? Interpret their meaning.

16. If you owned this stock, would you want a positive or negative slope? Why?

17. For each of the following situations, state the independent variable and the dependent variable.

- A study is done to determine if elderly drivers are involved in more motor vehicle fatalities than other drivers. The number of fatalities per 100,000 drivers is compared to the age of drivers.
- A study is done to determine if the weekly grocery bill changes based on the number of family members.
- Insurance companies base life insurance premiums partially on the age of the applicant.
- Utility bills vary according to power consumption.
- A study is done to determine if a higher education reduces the crime rate in a population.

18. Piece-rate systems are widely debated incentive payment plans. In a recent study of loan officer effectiveness, the following piece-rate system was examined:

% of goal reached | < 80 | 80 | 100 | 120 |

Incentive | n/a | $4,000 with an additional $125 added per percentage point from 81–99% | $6,500 with an additional $125 added per percentage point from 101–119% | $9,500 with an additional $125 added per percentage point starting at 121% |

19. If a loan officer makes 95% of his or her goal, write the linear function that applies based on the incentive plan table. In context, explain the *y*-intercept and slope.

20. Does the scatter plot appear linear? Strong or weak? Positive or negative?

21. Does the scatter plot appear linear? Strong or weak? Positive or negative?

22. Does the scatter plot appear linear? Strong or weak? Positive or negative?

23. The Gross Domestic Product Purchasing Power Parity is an indication of a country’s currency value compared to another country. The table below shows the GDP PPP of Cuba as compared to US dollars. Construct a scatter plot of the data.

Year | Cuba’s PPP | Year | Cuba’s PPP |
---|---|---|---|

1999 | 1,700 | 2006 | 4,000 |

2000 | 1,700 | 2007 | 11,000 |

2002 | 2,300 | 2008 | 9,500 |

2003 | 2,900 | 2009 | 9,700 |

2004 | 3,000 | 2010 | 9,900 |

2005 | 3,500 |

24. The following table shows the poverty rates and cell phone usage in the United States. Construct a scatter plot of the data

Year | Poverty Rate | Cellular Usage per Capita |
---|---|---|

2003 | 12.7 | 54.67 |

2005 | 12.6 | 74.19 |

2007 | 12 | 84.86 |

2009 | 12 | 90.82 |

25. Does the higher cost of tuition translate into higher-paying jobs? The table lists the top ten colleges based on mid-career salary and the associated yearly tuition costs. Construct a scatter plot of the data.

School | Mid-Career Salary (in thousands) | Yearly Tuition |
---|---|---|

Princeton | 137 | 28,540 |

Harvey Mudd | 135 | 40,133 |

CalTech | 127 | 39,900 |

US Naval Academy | 122 | 0 |

West Point | 120 | 0 |

MIT | 118 | 42,050 |

Lehigh University | 118 | 43,220 |

NYU-Poly | 117 | 39,565 |

Babson College | 117 | 40,400 |

Stanford | 114 | 54,506 |

26. If the level of significance is 0.05 and the *p*-value is 0.06, what conclusion can you draw?

27. If there are 15 data points in a set of data, what is the number of degree of freedom?

x | y | x | y |
---|---|---|---|

0 | 2 | 5 | 12 |

3 | 8 | 4 | 9 |

2 | 7 | 3 | 9 |

1 | 3 | 0 | 3 |

5 | 13 | 4 | 10 |

28. Draw a scatter plot of the data.

29. Use regression to find the equation for the line of best fit.

30. Draw the line of best fit on the scatter plot.

31. What is the slope of the line of best fit? What does it represent?

32. What is the *y*-intercept of the line of best fit? What does it represent?

33. What does an *r* value of zero mean?

34. When *n* = 2 and *r* = 1, are the data significant? Explain.

35. When *n* = 100 and *r* = -0.89, is there a significant correlation? Explain.
36. What is the process through which we can calculate a line that goes through a scatter plot with a linear pattern?

37. Explain what it means when a correlation has an *r*^{2} of 0.72.

38. Can a coefficient of determination be negative? Why or why not?

39. When testing the significance of the correlation coefficient, what is the null hypothesis?

40. When testing the significance of the correlation coefficient, what is the alternative hypothesis?

41. If the level of significance is 0.05 and the *p*-value is 0.04, what conclusion can you draw?

42. If the level of significance is 0.05 and the *p*-value is 0.06, what conclusion can you draw?

43. If there are 15 data points in a set of data, what is the number of degree of freedom?

*Use the following information to answer the next two exercises*. An electronics retailer used regression to find a simple model to predict sales growth in the first quarter of the new year (January through March). The model is good for 90 days, where *x* is the day. The model can be written as follows:
*ŷ* = 101.32 + 2.48*x* where *ŷ* is in thousands of dollars.

44. What would you predict the sales to be on day 60?

45. What would you predict the sales to be on day 90?

46. How many acres will be left to mow after 20 hours of work?

47. How many acres will be left to mow after 100 hours of work?

48. How many hours will it take to mow all of the lawns? (When is *ŷ* = 0?)

Year | # AIDS cases diagnosed | # AIDS deaths |

Pre-1981 | 91 | 29 |

1981 | 319 | 121 |

1982 | 1,170 | 453 |

1983 | 3,076 | 1,482 |

1984 | 6,240 | 3,466 |

1985 | 11,776 | 6,878 |

1986 | 19,032 | 11,987 |

1987 | 28,564 | 16,162 |

1988 | 35,447 | 20,868 |

1989 | 42,674 | 27,591 |

1990 | 48,634 | 31,335 |

1991 | 59,660 | 36,560 |

1992 | 78,530 | 41,055 |

1993 | 78,834 | 44,730 |

1994 | 71,874 | 49,095 |

1995 | 68,505 | 49,456 |

1996 | 59,347 | 38,510 |

1997 | 47,149 | 20,736 |

1998 | 38,393 | 19,005 |

1999 | 25,174 | 18,454 |

2000 | 25,522 | 17,347 |

2001 | 25,643 | 17,402 |

2002 | 26,464 | 16,371 |

Total | 802,118 | 489,093 |

49. Graph “year” versus “# AIDS cases diagnosed” (plot the scatter plot). Do not include pre-1981 data.

50. Perform linear regression. What is the linear equation? Round to the nearest whole number.

51. Write the equations:

Linear equation: __________

52. Solve.

When *x* = 1985, *ŷ* = _____

When *x* = 1990, *ŷ* =_____

When *x* = 1970, *ŷ* =______ Why doesn’t this answer make sense?

53. Does the line seem to fit the data? Why or why not?

54. What does the correlation imply about the relationship between time (years) and the number of diagnosed AIDS cases reported in the U.S.?

55. Plot the two given points on the following graph. Then, connect the two points to form the regression line.

Obtain the graph on your calculator or computer.

56. Write the equation: *ŷ*= ____________

57. Hand draw a smooth curve on the graph that shows the flow of the data.

58. Does the line seem to fit the data? Why or why not?

59. Do you think a linear fit is best? Why or why not?

60. What does the correlation imply about the relationship between time (years) and the number of diagnosed AIDS cases reported in the U.S.?

61. Graph “year” vs. “# AIDS cases diagnosed.” Do not include pre-1981. Label both axes with words. Scale both axes.

62. Enter your data into your calculator or computer. The pre-1981 data should not be included. Why is that so?
63. Write the linear equation, rounding to four decimal places:

64. Calculate the following:

*a*= _____*b*= _____- correlation = _____
*n*= _____

65. Recently, the annual number of driver deaths per 100,000 for the selected age groups was as follows:

Age | Number of Driver Deaths per 100,000 |
---|---|

17.5 | 38 |

22 | 36 |

29.5 | 24 |

44.5 | 20 |

64.5 | 18 |

80 | 28 |

- For each age group, pick the midpoint of the interval for the
*x*value. (For the 75+ group, use 80.) - Using “ages” as the independent variable and “Number of driver deaths per 100,000” as the dependent variable, make a scatter plot of the data.
- Calculate the least squares (best–fit) line. Put the equation in the form of:
*ŷ*=*a*+*bx* - Find the correlation coefficient. Is it significant?
- Predict the number of deaths for ages 40 and 60.
- Based on the given data, is there a linear relationship between age of a driver and driver fatality rate?
- What is the slope of the least squares (best-fit) line? Interpret the slope.

66. The table below shows the life expectancy for an individual born in the United States in certain years.

Year of Birth | Life Expectancy |
---|---|

1930 | 59.7 |

1940 | 62.9 |

1950 | 70.2 |

1965 | 69.7 |

1973 | 71.4 |

1982 | 74.5 |

1987 | 75 |

1992 | 75.7 |

2010 | 78.7 |

- Decide which variable should be the independent variable and which should be the dependent variable.
- Draw a scatter plot of the ordered pairs.
- Calculate the least squares line. Put the equation in the form of:
*ŷ*=*a*+*bx* - Find the correlation coefficient. Is it significant?
- Find the estimated life expectancy for an individual born in 1950 and for one born in 1982.
- Why aren’t the answers to part e the same as the values in Table that correspond to those years?
- Use the two points in part e to plot the least squares line on your graph from part b.
- Based on the data, is there a linear relationship between the year of birth and life expectancy?
- Are there any outliers in the data?
- Using the least squares line, find the estimated life expectancy for an individual born in 1850. Does the least squares line give an accurate estimate for that year? Explain why or why not.
- What is the slope of the least-squares (best-fit) line? Interpret the slope.

67. The maximum discount value of the Entertainment® card for the “Fine Dining” section, Edition ten, for various pages is given in the table below.

Page number | Maximum value ($) |
---|---|

4 | 16 |

14 | 19 |

25 | 15 |

32 | 17 |

43 | 19 |

57 | 15 |

72 | 16 |

85 | 15 |

90 | 17 |

- Decide which variable should be the independent variable and which should be the dependent variable.
- Draw a scatter plot of the ordered pairs.
- Calculate the least-squares line. Put the equation in the form of:
*ŷ*=*a*+*bx* - Find the correlation coefficient. Is it significant?
- Find the estimated maximum values for the restaurants on page ten and on page 70.
- Does it appear that the restaurants giving the maximum value are placed in the beginning of the “Fine Dining” section? How did you arrive at your answer?
- Suppose that there were 200 pages of restaurants. What do you estimate to be the maximum value for a restaurant listed on page 200?
- Is the least squares line valid for page 200? Why or why not?
- What is the slope of the least-squares (best-fit) line? Interpret the slope.

68. The table below gives the gold medal times for every other Summer Olympics for the women’s 100-meter freestyle (swimming).

Year | Time (seconds) |
---|---|

1912 | 82.2 |

1924 | 72.4 |

1932 | 66.8 |

1952 | 66.8 |

1960 | 61.2 |

1968 | 60.0 |

1976 | 55.65 |

1984 | 55.92 |

1992 | 54.64 |

2000 | 53.8 |

2008 | 53.1 |

- Decide which variable should be the independent variable and which should be the dependent variable.
- Draw a scatter plot of the data.
- Does it appear from inspection that there is a relationship between the variables? Why or why not?
- Calculate the least squares line. Put the equation in the form of:
*ŷ*=*a*+*bx*. - Find the correlation coefficient. Is the decrease in times significant?
- Find the estimated gold medal time for 1932. Find the estimated time for 1984.
- Why are the answers from part f different from the chart values?
- Does it appear that a line is the best way to fit the data? Why or why not?
- Use the least-squares line to estimate the gold medal time for the next Summer Olympics. Do you think that your answer is reasonable? Why or why not?

State | # letters in name | Year entered the Union | Rank for entering the Union | Area (square miles) |
---|---|---|---|---|

Alabama | 7 | 1819 | 22 | 52,423 |

Colorado | 8 | 1876 | 38 | 104,100 |

Hawaii | 6 | 1959 | 50 | 10,932 |

Iowa | 4 | 1846 | 29 | 56,276 |

Maryland | 8 | 1788 | 7 | 12,407 |

Missouri | 8 | 1821 | 24 | 69,709 |

New Jersey | 9 | 1787 | 3 | 8,722 |

Ohio | 4 | 1803 | 17 | 44,828 |

South Carolina | 13 | 1788 | 8 | 32,008 |

Utah | 4 | 1896 | 45 | 84,904 |

Wisconsin | 9 | 1848 | 30 | 65,499 |

- Decide which variable should be the independent variable and which should be the dependent variable.
- Draw a scatter plot of the data.
- Does it appear from inspection that there is a relationship between the variables? Why or why not?
- Calculate the least-squares line. Put the equation in the form of:
*ŷ*=*a*+*bx*. - Find the correlation coefficient. What does it imply about the significance of the relationship?
- Find the estimated number of letters (to the nearest integer) a state would have if it entered the Union in 1900. Find the estimated number of letters a state would have if it entered the Union in 1940.
- Does it appear that a line is the best way to fit the data? Why or why not?
- Use the least-squares line to estimate the number of letters a new state that enters the Union this year would have. Can the least squares line be used to predict it? Why or why not?

*Use the following information to answer the next four exercises.* The scatter plot shows the relationship between hours spent studying and exam scores. The line shown is the calculated line of best fit. The correlation coefficient is 0.69.

70. Do there appear to be any outliers?

71. A point is removed, and the line of best fit is recalculated. The new correlation coefficient is 0.98. Does the point appear to have been an outlier? Why?

72. What effect did the potential outlier have on the line of best fit?

73. Are you more or less confident in the predictive ability of the new line of best fit?

74. The Sum of Squared Errors for a data set of 18 numbers is 49. What is the standard deviation?

75. The Standard Deviation for the Sum of Squared Errors for a data set is 9.8. What is the cutoff for the vertical distance that a point can be from the line of best fit to be considered an outlier?

76. The height (sidewalk to roof) of notable tall buildings in America is compared to the number of stories of the building (beginning at street level).

Height (in feet) | Stories |
---|---|

1,050 | 57 |

428 | 28 |

362 | 26 |

529 | 40 |

790 | 60 |

401 | 22 |

380 | 38 |

1,454 | 110 |

1,127 | 100 |

700 | 46 |

- Using “stories” as the independent variable and “height” as the dependent variable, make a scatter plot of the data.
- Does it appear from inspection that there is a relationship between the variables?
- Calculate the least squares line. Put the equation in the form of:
*ŷ*=*a*+*bx* - Find the correlation coefficient. Is it significant?
- Find the estimated heights for 32 stories and for 94 stories.
- Based on the data in Table, is there a linear relationship between the number of stories in tall buildings and the height of the buildings?
- Are there any outliers in the data? If so, which point(s)?
- What is the estimated height of a building with six stories? Does the least squares line give an accurate estimate of height? Explain why or why not.
- Based on the least squares line, adding an extra story is predicted to add about how many feet to a building?
- What is the slope of the least squares (best-fit) line? Interpret the slope.

77. Ornithologists, scientists who study birds, tag sparrow hawks in 13 different colonies to study their population. They gather data for the percent of new sparrow hawks in each colony and the percent of those that have returned from migration.

**Percent return:**74; 66; 81; 52; 73; 62; 52; 45; 62; 46; 60; 46; 38

- Enter the data into your calculator and make a scatter plot.
- Use your calculator’s regression function to find the equation of the least-squares regression line. Add this to your scatter plot from part a.
- Explain in words what the slope and y-intercept of the regression line tell us.
- How well does the regression line fit the data? Explain your response.
- Which point has the largest residual? Explain what the residual means in context. Is this point an outlier? An influential point? Explain.
- An ecologist wants to predict how many birds will join another colony of sparrow hawks to which 70% of the adults from the previous year have returned. What is the prediction?

78. The following table shows data on average per capita wine consumption and heart disease rate in a random sample of 10 countries.

Yearly wine consumption in liters | 2.5 | 3.9 | 2.9 | 2.4 | 2.9 | 0.8 | 9.1 | 2.7 | 0.8 | 0.7 |

Death from heart diseases | 221 | 167 | 131 | 191 | 220 | 297 | 71 | 172 | 211 | 300 |

- Enter the data into your calculator and make a scatter plot.
- Use your calculator’s regression function to find the equation of the least-squares regression line. Add this to your scatter plot from part a.
- Explain in words what the slope and y-intercept of the regression line tell us.
- How well does the regression line fit the data? Explain your response.
- Which point has the largest residual? Explain what the residual means in context. Is this point an outlier? An influential point? Explain.
- Do the data provide convincing evidence that there is a linear relationship between the amount of alcohol consumed and the heart disease death rate? Carry out an appropriate test at a significance level of 0.05 to help answer this question.

79. The following table consists of one student athlete’s time (in minutes) to swim 2000 yards and the student’s heart rate (beats per minute) after swimming on a random sample of 10 days:

Swim Time | Heart Rate |
---|---|

34.12 | 144 |

35.72 | 152 |

34.72 | 124 |

34.05 | 140 |

34.13 | 152 |

35.73 | 146 |

36.17 | 128 |

35.57 | 136 |

35.37 | 144 |

35.57 | 148 |

- Enter the data into your calculator and make a scatter plot.
- Use your calculator’s regression function to find the equation of the least-squares regression line. Add this to your scatter plot from part a.
- Explain in words what the slope and y-intercept of the regression line tell us.
- How well does the regression line fit the data? Explain your response.
- Which point has the largest residual? Explain what the residual means in context. Is this point an outlier? An influential point? Explain.

80. A researcher is investigating whether non-white minorities commit a disproportionate number of homicides. He uses demographic data from Detroit, MI to compare homicide rates and the number of the population that are white males.

White Males | Homicide rate per 100,000 people |
---|---|

558,724 | 8.6 |

538,584 | 8.9 |

519,171 | 8.52 |

500,457 | 8.89 |

482,418 | 13.07 |

465,029 | 14.57 |

448,267 | 21.36 |

432,109 | 28.03 |

416,533 | 31.49 |

401,518 | 37.39 |

387,046 | 46.26 |

373,095 | 47.24 |

359,647 | 52.33 |

- Use your calculator to construct a scatter plot of the data. What should the independent variable be? Why?
- Use your calculator’s regression function to find the equation of the least-squares regression line. Add this to your scatter plot.
- Discuss what the following mean in context.
- The slope of the regression equation
- The y-intercept of the regression equation
- The correlation r
- The coefficient of determination r2.

- Do the data provide convincing evidence that there is a linear relationship between the number of white males in the population and the homicide rate? Carry out an appropriate test at a significance level of 0.05 to help answer this question.

School | Mid-Career Salary (in thousands) | Yearly Tuition |
---|---|---|

Princeton | 137 | 28,540 |

Harvey Mudd | 135 | 40,133 |

CalTech | 127 | 39,900 |

US Naval Academy | 122 | 0 |

West Point | 120 | 0 |

MIT | 118 | 42,050 |

Lehigh University | 118 | 43,220 |

NYU-Poly | 117 | 39,565 |

Babson College | 117 | 40,400 |

Stanford | 114 | 54,506 |

81. Using the data to determine the linear-regression line equation with the outliers removed. Is there a linear correlation for the data set with outliers removed? Justify your answer.

82. The average number of people in a family that received welfare for various years is given in Table.

Year | Welfare family size |
---|---|

1969 | 4.0 |

1973 | 3.6 |

1975 | 3.2 |

1979 | 3.0 |

1983 | 3.0 |

1988 | 3.0 |

1991 | 2.9 |

- Using “year” as the independent variable and “welfare family size” as the dependent variable, draw a scatter plot of the data.
- Calculate the least-squares line. Put the equation in the form of:
*ŷ*=*a*+*bx* - Find the correlation coefficient. Is it significant?
- Pick two years between 1969 and 1991 and find the estimated welfare family sizes.
- Based on the data in Table, is there a linear relationship between the year and the average number of people in a welfare family?
- Using the least-squares line, estimate the welfare family sizes for 1960 and 1995. Does the least-squares line give an accurate estimate for those years? Explain why or why not.
- Are there any outliers in the data?
- What is the estimated average welfare family size for 1986? Does the least squares line give an accurate estimate for that year? Explain why or why not.
- What is the slope of the least squares (best-fit) line? Interpret the slope.

83. The percent of female wage and salary workers who are paid hourly rates is given in Table for the years 1979 to 1992.

Year | Percent of workers paid hourly rates |
---|---|

1979 | 61.2 |

1980 | 60.7 |

1981 | 61.3 |

1982 | 61.3 |

1983 | 61.8 |

1984 | 61.7 |

1985 | 61.8 |

1986 | 62.0 |

1987 | 62.7 |

1990 | 62.8 |

1992 | 62.9 |

- Using “year” as the independent variable and “percent” as the dependent variable, draw a scatter plot of the data.
- Does it appear from inspection that there is a relationship between the variables? Why or why not?
- Calculate the least-squares line. Put the equation in the form of:
*ŷ*=*a*+*bx* - Find the correlation coefficient. Is it significant?
- Find the estimated percents for 1991 and 1988.
- Based on the data, is there a linear relationship between the year and the percent of female wage and salary earners who are paid hourly rates?
- Are there any outliers in the data?
- What is the estimated percent for the year 2050? Does the least-squares line give an accurate estimate for that year? Explain why or why not.
- What is the slope of the least-squares (best-fit) line? Interpret the slope.

Size (ounces) | Cost ($) | Cost per ounce |
---|---|---|

16 | 3.99 | |

32 | 4.99 | |

64 | 5.99 | |

200 | 10.99 |

84.

- Using “size” as the independent variable and “cost” as the dependent variable, draw a scatter plot.
- Does it appear from inspection that there is a relationship between the variables? Why or why not?
- Calculate the least-squares line. Put the equation in the form of:
*ŷ*=*a*+*bx* - Find the correlation coefficient. Is it significant?
- If the laundry detergent were sold in a 40-ounce size, find the estimated cost.
- If the laundry detergent were sold in a 90-ounce size, find the estimated cost.
- Does it appear that a line is the best way to fit the data? Why or why not?
- Are there any outliers in the given data?
- Is the least-squares line valid for predicting what a 300-ounce size of the laundry detergent would you cost? Why or why not?
- What is the slope of the least-squares (best-fit) line? Interpret the slope.

- Complete Table for the cost per ounce of the different sizes.
- Using “size” as the independent variable and “cost per ounce” as the dependent variable, draw a scatter plot of the data.
- Does it appear from inspection that there is a relationship between the variables? Why or why not?
- Calculate the least-squares line. Put the equation in the form of:
*ŷ*=*a*+*bx* - Find the correlation coefficient. Is it significant?
- If the laundry detergent were sold in a 40-ounce size, find the estimated cost per ounce.
- If the laundry detergent were sold in a 90-ounce size, find the estimated cost per ounce.
- Does it appear that a line is the best way to fit the data? Why or why not?
- Are there any outliers in the the data?
- Is the least-squares line valid for predicting what a 300-ounce size of the laundry detergent would cost per ounce? Why or why not?
- What is the slope of the least-squares (best-fit) line? Interpret the slope.