Statistics (Math 040-05)

Rooms vary depending on day of the week: M STM G40; TW REI 283; F REI 262

You can reach me by phone (or voice-mail) at 7-2703 or by e-mail kainen at georgetown.edu or drop by Reiss 258 during my office hours.

****      Last updated     ****


         May. 8, 2006 


     **********************

Other general information, including for other courses, on the index (classroom) page . The following link takes you to course background information.

Office Hours: Thurs. May 11, 3 to 6 pm; Fri. May 12, 1 to 3 pm.

Location: Reiss 258 (or outdoors in front of Reiss on a nice day!)

The Math Assistance Center, Reiss 256, Sun. - Thurs. evenings will open soon. It usually only covers courses through 036 but I will enquire to see if help can be made available for Math 040 as well.


Please remember to always use the "reload" feature to be sure you have the current version of this page! Homework, tests, etc., which are not picked up in class will be in the box outside my door (Reiss 258).

The old stats homework is located here . remember to use the odd-numbered problems for review. There will be a review session on May 10 from 9 to 10:30 pm in Reiss 283. The final exam is listed on the registar's site as follows (note that it is in White-Gravenor!):

MATH-040-05 WGR 208 Kainen P Sat, May 13 9:00-11:00AM 

In response to a question from Daniel Grasso:


If you look at the histogram on p. 458, it allows you to reconstruct
the original dataset: 20, 22.5, 25,25,27.5,27.5,27.5, etc., ..., 37.5
(a total of n = 1 + 1 + 2 + 5 + 5 + 5 + 3 + 1 = 23 in the sample), so,
calculating y-bar, you should get 31, etc., as they claim.  Does that
help?

Your question was good and I'm going to post it on our website to help
others.


> professor, 
> 
> I am drawing a blank as to how to find (s) for a one-sample t-
> interval.  On page 459 there is a problem about car speeds and it 
> states that the s=4.25 mph.  I am having difficulty understanding 
> how they found that number.  If you have time i would really 
> appreciate it if you can re-explain this via a quick email. 
> 
> Thanks,
> 
> Daniel
> 
> 

Here is an example of a chi-square problem:

Suppose you have seven different car types
and three different groups (executives,
middle management, and technical staff).
If you count the numbers of each type of 
car owned by each of the three groups you
get a table with 7 rows and 3 columns.  To
get the expected number of car-type-1 owned
by executives exp(1,exec), use the following
equation

  exp(1,exec) = frac(exec) * tot # of car-type-1

where frac(exec) = (# of execs)/N,

N = # of execs + # of midmgt + # of tech staff,

tot # of car-type-1 = # of car-type-1 observed
                        by the three groups

So if the execs constitute 1/3 of N and 45 cars
of type 1 are counted, one expects that 15 of
them will be owned by execs.  If actually 20
are owned by execs, then the corresponding
residual for that cell of the table will have
the value 5.  To calculate chi-square, you take
the residual for each cell, square it, and then
divide by the expected number.  For the cell
corresponding to execs and car-type-1, you would
get 25/15 = 1.67.  Now add these numbers over 
all the 21 cells (7 * 3) and that's the 
chi-square value.  

Now suppose that the value you get is 30.9.
Should you reject the null hypothesis that all
three groups have homogeneous choice of cars,
in favor of the alternative hypoth: the three
groups don't have homogeneous choice of cars?
Suppose you need to keep alpha at most .05.

Try it first before you look!

























Spoiler below ...




























First, to decide if the chi-square test is
even appropriate, let's assume that the usual
blather about randomness is ok.  THere is one
serious condition: the expected counts should
be greater than 5.  Let's assume that's ok, too.

Now all you need to answer this is to find the
number of degrees of freedom (d.f.).  For a
table with r rows and c columns, d.f. is (r-1)
times (c-1) so the d.f. is 6*2 = 12.  Since
alpha = .05, using the table in our text, the
critical chi-square value is 28.3. Hence, if 
the value you calculate from the table is more
than 28.3, then you reject the null hypoth.

Since 30.9 > 28.3, you reject H_0.

Thus, according to our chi-square test, the 
type of car is not indep. of the employment 
type (exec, midmgt, tech).

Note that for a chi-square goodness-of-fit test, 
the d.f. is just n-1 where n is # of cells.  

For such a hypoth-testing problem, like the others we have studied, you need to know: (1) what is the test statistic? (i.e., z, t, or chi-square); (2) how do I calculate the test statistic? e.g., for chi-square, you take the sum of (obs-exp)^2/exp over all the cells, while for a t-test of a hypoth. about the mean, you take X-bar - mu_0 and divide by SE(X-bar); (3) how do I calculate the critical value of the test stat so that rejecting H_0 based on being above (or below) the critical value has at most prob. alpha of rejecting H_0 if, in fact, H_0 is true? E.g., if X-bar involves a sample of size n, then look up the t-value with n-1 d.f. corresponding to prob. alpha using 1-tail or 2-tail depending on whether the hypoth. test is 1-tail or 2-tail (for d.f. at least 50, there is very little difference between the z and t distributions and I'll accept either one). Also note that you can find SE(X-bar) by taking either s or sigma and dividing by sqrt(n). (4) do large or small values provide evidence to reject H_0? For instance, a small value of X-bar can't give evidence in favor of a 1-tail hypothesis where H_A asserts that the mean is larger than some given number mu_0 since the null hypothesis H_0: mu = mu_0 means in this case mu at most mu_0.

These questions (1) through (4) would be appropriate for our midterm. Note that they require very little computation but they may use a table. Don't worry if you've lost your tables - I'll give everyone copies to use. It is the concepts and procedures that I want to make sure you understand.

As announced in class on Friday, midterm postponed to Mon. May 1. We will also cover part of chap. 27 - but only what is given in the three problems below. Note the chart at the end of the chapter which compares the tests for the slope of the regression line with tests for the mean.

For Wed., in Chap. 27 #2,4,8,18,20. We will go over some of these problems in class - I'll call on some of you to present them. For #2, assuming that the conditions have been met, the sampling distribution of the regression slope can be modeled by a t-model with 11-2 = 9 d.f. Your calculation should give t = 7.85 which has a P-value of .0001 (by program) or using the table in the back, the P-value is (much) less than .01.

If H_0: mu = 90 and H_A: mu < 90  and if alpha = .01,
decide whether or not to reject H_0 in favor of H_A 
based on the following data:

Sample of size 16 has X-bar = 78 and s = 20.

Assume the population is approximately normal.

Since the sample average is in the direction favored by the one-sided alternate hypothesis, it is plausible that it could give sufficient evidence to reject H_0 in favor of H_A - but only if it is far enough below what one could reasonably expect by chance if H_0 were true. Far enough means that the calculated t-statistic given by


   X-bar - mu_0      78 - 90        - 12
  -------------  = ------------ = ------- = -2.4
    SE(X-bar)       20/sqrt(16)      5

is below the cut-off value of


    -t_(.01) = -2.602

as given by the t-table, 2nd col from the right (for alpha = .01
in a one-tail situation, with 15 degrees of freedom.

Since -2.4 is _not_ less than -2.602, the evidence is not sufficient to reject H_0 with at most a one-percent chance of making a type I error.



If H_0 is the same, H_A: mu not equal to 90, alpha
is still .01, decide whether or not to reject H_0
in favor of this new H_A given following data:

Sample of size 16 has X-bar = 78 and s = 14.

Assume the population is approximately normal.

Now we have a two-tail hypothesis to test, and sample averages that are sufficiently large as well as sufficiently small can both provide enough evidence to reject H_0 in favor of the new H_A. Since the population is small but s is known for the sample, we can use the t-test again. We first calculate the test statistic:


... just as above but instead of 20/sqrt(16) = 5, we now have

           14/4 = 3.5

in the denominator, so the t-value is -12/3.5 = -3.43, while the
critical t-value in the two-tail case with alpha = .01 is -2.947
so the statistic is small enough to reject the null hypothesis.



Determine a 99.7 percent confidence interval for mu
given that the s.d. sigma is known to be 14, basing
the confidence interval on a sample of size 49 with
sample average X-bar = 33.18.

Here you first calculate the margin of error

      ME = s.d.(X-bar) * z_.0015

where s.d.(X-bar) = sigma/sqrt(n) = 14/sqrt(49) = 2,
and z.0015 = 3.  Hence, the ME = 6, so the interval,
centered at 33.18, of

    [27.18, 39.18]

is the 99.7-percent confidence interval.  Note that
it would also be ok to use the t-distribution here,
with 48 d.f., but the values will be quite close.

Ok here's a simple "homogeneity" problem which you can compute by hand. I'll include the answer below so you can check it. Then you can try a computer method and see if it gives the same answers!


Suppose on a cruise ship you know that the following 
counts occurred for the types of breakfast consumed by
passengers and crew.  Do you believe that there is no
evidence of difference between the choices made by crew
and passengers (homogeneity)?  Or should you reject the
null hypothesis?  Explain.


                    passengers               crew


cold cereal            30                     10

hot cereal             20                     10

eggs and ham           40                     60

continental            10                     20














spoiler below ...




















spoiler below ...







The null hypothesis H_0: Passengers and crew have the
same distribution of breakfast preferences

vs. H_A: Passengers and crew have different preferences.



                    passengers     crew        total


cold cereal            30 (20)      10 (20)      40

hot cereal             20 (15)      10 (15)      30

eggs and ham           40 (50)      60 (50)     100

continental            10 (15)      20 (15)      30

total                 100          100          200


Since 100/200 = 1/2,  expected counts are as shown 
(in parens), so squaring the residuals (obs-exp) and
dividing by the expected counts, you should get the
table of chi-squared entries (i.e., values to be summed)
E.g., for the upper left cell, 30-20 = 10 and squaring
10 and dividing by 20 gives 100/20 = 5, etc.

                    passengers               crew


cold cereal            5                      5

hot cereal             25/15                 25/15

eggs and ham           2                      2

continental            25/15                 25/15

So the chi-squared value is 14 + 100/15 = 20.66...
Since r = 4 and c = 2, df = 3 and from the table in
the back of the book, the P-value is less than .005
(in fact, very much less). 

Hence, you should reject the null hypothesis.

The homework (given out in class last Friday) was collected on Tues. Apr. 11, but the quiz was deferred to Wed. Apr. 19.

For Wed., April 19, we will go over the homework which I gave you over the break (e.g., for #1, the first CI is (.55,.65) - or slightly smaller if you calculate exactly). I asked you to do #2 and #12 of Chap. 26 also.

For the quiz, also on 4/19, I'll ask you:

  1. to convert a written description into a formal statement of two hypotheses,
  2. to give a 95-percent confidence interval for a proportion using the sample proportion and sample size,
  3. to say how much larger the sample would need to be in order to decrease the length of the CI by 9-fold
  4. to provide a suitable conclusion regarding whether or not to reject H_0 (you'll be given H_0 and H_A) from sample data in 2 situations involving the mean of a r.v.
Here are partial answers for #1 through #5 
just below on the page:

!.  CI is [.55,.65] approx.

2a  Don't reject H_0  (the data has wrong
direction in terms of the alternative)

2b  Don't reject H_0:

You calculate a statistic which comes out
to be 1.27 but that isn't large enough 
(it would have to exceed z_.025 = 1.96).



3   Reject H_0  (since 2 > z_.05 = 1.645)

4   Retain H_0 

the test statistic is (42 - 50)/(11.9/sqrt(12))
which is about -2.32 but this is not less than
-t_.01 = -2.72  

don't worry about the P-value 

5.  Reject H_0  (the stat is about 3 which
is bigger than 1.96)

For #5, the stat is 

(p_1-hat - p_2-hat)/SE(p_1-hat - p_2-hat)

= (approximately) .12/.04

since the null hypothesis is that p_1 = p_2
and hence the mean of (p_1-hat - p_2-hat)
(as a sampling distribution) is zero.

One gets SE(p_1-hat - p_2-hat) using  


sqrt(p_1-hat*q_1-hat/n_1 + p_2-hat*q_2-hat/n_2)

and p_1-hat itself is 120/200 = .60, 
p_2-hat is 240/500 = .48


Here are some practice problems for Tues. 4/18. I won't collect them but I will call on some of you to put the problems on the board.

1. A random sample of 400 voters provides a sample proportion of .60
in favor of some proposition.  Give a 95-percent confidence interval
for the true proportion of the population which favors the proposition.

2.(a) If the null hypothesis is that 3 percent of new tires are defective
vs. the alternative hypothesis that more than 3 percent are defective,
what can you conclude from a sample of 625 tires with a sample percentage
of 2 percent defective?  Assume the sampling is suitably random.

  (b) If your sample percentage is 4 percent defective and you have an
significance level (alpha) of .025, what do you do?

3. A random sample of 100 recorded deaths in the US during the past year
showed an average life span of 71.8 years with s.d. of 8.9 years. Does
this seem to indicate that the average life span today is greater than
70 years?  Use a 0.05-level of significance.

4. The average length of time for a student to register for classes at
some college is 50 minutes.  Suppose a new procedure is tried on a random
sample of 12 students and the average time is 42 minutes with a standard
error of 11.9 minutes.  Test the hypothesis that the new procedure is
faster than the old procedure, using a significance of .01.  What is the
p-value?  Assume that the distribution of times is approximately normal.

5. If in a sample of 200 urban voters, 120 favor a proposition, while
in a sample of 500 rural voters, 240 favor the proposition, should we
reject the null hypothesis that these two groups have the same degree
of support for the proposition in favor of the alternative that the
urban support is greater than the rural support?  Use a .025 level of
significance.

Please do #10 and #16 of CH. 20. For #16, ALSO do the problem with a different alternative hypothesis H_A: p not equal to p_0 (i.e., the 2-tail version). These will be collected for homework on Friday Apr. 7. Also read chapter 21 up to p. 420.

Homework due Tues. April 4: CH. 19: #4, 12a,b, 14, 20, 22, 24.

For hw due 3/28: pp. 349--351: #4,16; pp. 370--372: #4,10,16,28* (for the last problem you may use a calculator)

For Friday (3/24), try pp. 349--351: #4,16,22. 
Read Ch. 18, and then try on pp. 370--372: #4,10,16,22,28.
Some of these will be selected for homework due Tues. 3/28.

For homework due Tues. Mar. 21 for collection. I did the problems below in class today with the values given in the text (.02 and 4/5, resp.). The homework below asks you to do the same problems but with the changed probabilities of .01 and 2/3, resp. If you missed class, you should check with someone who was there to see the notes.

Ch. 17, pp.343--344: #8 and 10 but change P(chip fails) to .01;
#14a,b,d and 16a,b - but with P(hit the target) to 2/3.

Chap. 16 : Homework for Tues. 3/14: pp. 325--326 #2b,8 10b,22,24 - total of 5 problems to be collected.

Also for Wed. 3/15 try pp. 327--328 #32,38 and read Ch. 17 - try: #8,10,14,16,18,22,26,28,36 pp. 343--345.

#1 (a) What is the probability of drawing 2 hearts 
if you draw 2 cards from a standard deck?  

(b) What is the probability of drawing 2 cards of
the same suit if you draw 2 cards from a standard deck?

(c) What is the expected number of hearts if you draw
2 cards from a standard deck?


... spoiler below (try it first and then look)



























Answers: We write C(n,k) for the binomial coeff.
n choose k = n!/k!(n-k)!  Use cancellation when
you can to make the calculations easier.

1a C(13,2)/C(52,2) = 13*12/52*51
                   = 1/17 = .06 approx

note that P(1 H and 1 nonH) = 13*39/C(52,2) 
                   = 13/34 = .38 approx

so P(2 nonH) is 1 - (.06 + .38) = .56 approx,
and this agrees with the direct calculation

P(2nonH) = C(39,2)/C(52,2) = 39*38/52*51 
                           = 38/68 = .5588 ...

1b 4 times larger = .24 approx

1c E(# of hearts) = E(X) = Sum_x P(X=x)*x 
 = 2 * .06 + 1 * (.38) = .50 and in fact
this is exact: 2/17 + 13/34 = 17/34 = 1/2.

Note that E(# of hearts in drawing one card)
is 1/4 so E(# of hearts in drawing two) = 1/2,
and E(# of hearts in drawing 12 cards) = 12/4 = 3.

Some answers: Ch. 14 #10c - legitimate, 
d - not legit since can't have negatives

#18a .027, b .125, c .001, d .729, 3 .784

#24 .469

Ch. 15 #6a .06, b .50; #20a .444, b No. 
(4 % of US residents have been to both)
c No. 18 % of res. have been to Canada
but 4/9 = 44.4 % of those who have been
to Mexico have also been to Canada, so
P(C) is not equal to P(C|M).  Intuitively,
this should make sense; the likelihood of
traveling to Mexico is higher given that
a person has traveled to Canada.
#10 abc .62, .867, .194
#26a Yes, percent who graduate depends 
on the school they attend. Part b: 
(.7)(.75) + (.3)(.9) = .525 + .27 = .795
#36 P(A|D) = P(A and D)/P(D) =
(.7)(.01)/[(.7)(.01) + (.2)(.02) + (.1)(.04)]
= .007/[.007 + .004 + .004] = 7/15 = .467

The reason is that 
D = (D and A) or (D and B) or (D and C) so
P(D) = P(D and A) + P(D and B) + P(D and C)
since these events are mutually exclusive
(exactly one can occur). Now P(D and A) is
P(A)P(D|A), etc.

For Tuesday Feb. 21, try Ch. 14 #24; Ch. 15 #6ab, 8ab, 20abc, 24, 26, 28, 30, 32, 36; for homework to be collected on Wed. Feb. 22, Ch. 14: #18abc, Ch. 15: #10abc,36

Here are some selected answers to check your work:
Ch. 14
#14a .04, .51, .55; b: (.4)^4 = .0256, etc.

Ch. 15
#26b 79.5 percent; #28 66 percent


For Fri. 2/17, please read Chap. 14 and on pp. 291--293,
please try #10,14,16,18(a,b,c).  Then read pp. 294--302
in Chap. 15 (you can read more if you have time) and 
on pp. 310--311 try #2,4,10,14(a,c).  

Note that there is a misprint in the book on p. 302 
(right-hand column, last paragraph above "drawing without
replacement").  It should read:

"Overall 36 percent of the drivers got blood tests, but
only _28_ percent of those who got a breath test do."
(i.e., also got a blood test).

For Wed., read through chapters 10 -- 13 and bring your questions to class!

For Tuesday 2/14, read chapter 13 (if you haven't already done so) and be prepared to discuss the following problems on pp 267--271: #2,4,10,12,20,22,24. These won't be collected tomorrow but make some notes so you can respond in class.

For Tues. Feb 7: homework to be collected: Chap. 9 (pp. 181--182) #8,12,14 (use #11 and 13 to see what is expected from #12 and 14). We'll talk about #11 and 13 in class on Monday but try them first on your own - then look up the answers in the book and reread the chapter to see why the book's answers are correct.

Here are problems from Chap. 8 for review: pp. 158--164: #4,6,8,10,18,22. Also for Ch. 8: #2 ab and #36 (_not #30 which I mistakenly assigned before; we will do it using the computer later - though you can try now if you like).

For those who want to check my claim that the slope of the
line of best fit does not change if the x and y variables
are interchanged - provided that the standard deviations
are the same, here is the concrete example I mentioned.

Here are three data points for which you can make a scatterplot:

(1,4), (3,-1), (-4,-3)  that is, x = (1,3,-4) and y = (4,-1,-3).

Note that x-bar = 0 = y-bar; i.e., both means are zero.
Also the standard deviations are both sqrt(13).  For instance,
s_x = sqrt [(1 + 9 + 16)/2].  Let's figure out the slope of
the line of best fit.  The line must go through (0,0) since
both means are zero and let's call its slope m.  

The residuals are the differences y_i - y-hat_i, where y-hat_i
is the value predicted for y_i by using the line; in this case,
y-hat_i = m x_i.  Calculating the sum of the squared residuals by
using the algebraic fact that (a + b)^2 = a^2 + 2ab + b^2, you get

SSR(m) = (4 - m)^2 + (-1 - 3m)^2 + (-3 + 4m)^2 = 

(16 - 8m + m^2) + (1 + 6m + 9m^2) + (9 - 24m + 16m^2) = 

26(m^2 - m + 1).  

To find the value of m which minimizes SSR(m), take the derivative,
set the result equal to zero, and solve for m.

So 2m - 1 = 0 and hence m = 1/2.

Now if you take the same data but reverse x and y, you have
x' = (4,-1,-3) and y' = (1,3,-4) so the means are still zero
and the standard deviations are still the same.  Let k be the
slope of the line of best fit for this new ``reversed'' data,
and do the same calculation as above to find 

SSR(k) = (1 - 4k)^2 + (3 + k)^2 + (-4 + 3k)^2.

Doing the same calculation as above, you should get
26(k^2 - k + 1) so the unique root of the derivative is 1/2 again!
That means that the line of best fit for the reversed data has
the same slope as the line of best fit for the original data.

You may wish to try the following online survey.