1
00:00:00,499 --> 00:00:02,830
The following content is
provided under a Creative

2
00:00:02,830 --> 00:00:04,340
Commons license.

3
00:00:04,340 --> 00:00:06,680
Your support will help
MIT OpenCourseWare

4
00:00:06,680 --> 00:00:11,050
continue to offer high quality
educational resources for free.

5
00:00:11,050 --> 00:00:13,660
To make a donation or
view additional materials

6
00:00:13,660 --> 00:00:17,563
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,563 --> 00:00:18,188
at ocw.mit.edu.

8
00:00:23,224 --> 00:00:25,030
TOM LEIGHTON: Today
we're going to talk

9
00:00:25,030 --> 00:00:29,560
about the concept
of independence.

10
00:00:29,560 --> 00:00:41,760
In probability, we
say that an event A

11
00:00:41,760 --> 00:01:00,210
is independent of an event B
if one of two conditions hold.

12
00:01:00,210 --> 00:01:04,910
First, if the
probability of A given B

13
00:01:04,910 --> 00:01:08,910
is just the same as
the probability of A

14
00:01:08,910 --> 00:01:16,470
or if B can't happen, namely
the probability of B is 0.

15
00:01:16,470 --> 00:01:22,450
In other words, A is independent
of B if knowing that B happened

16
00:01:22,450 --> 00:01:26,060
doesn't change the probability
that A is going to happen.

17
00:01:26,060 --> 00:01:29,720
So knowing that
this event occurs

18
00:01:29,720 --> 00:01:32,460
doesn't influence the
probability that A occurs.

19
00:01:32,460 --> 00:01:35,700
And there's a special case where
they're independent because you

20
00:01:35,700 --> 00:01:37,820
know that B can't happen.

21
00:01:37,820 --> 00:01:41,020
If the probability
of B happening is 0,

22
00:01:41,020 --> 00:01:44,271
then everything is
independent of B.

23
00:01:44,271 --> 00:01:47,260
Now, the typical
example that gets used

24
00:01:47,260 --> 00:01:48,515
is when you flip two coins.

25
00:01:53,970 --> 00:01:59,525
So say we flip two
fair, independent coins.

26
00:02:08,020 --> 00:02:21,090
And let's let B be the event
that the first coin is heads

27
00:02:21,090 --> 00:02:25,930
and that means that the
probability of B happening

28
00:02:25,930 --> 00:02:30,040
is 1/2, because we've
assumed it's a fair coin,

29
00:02:30,040 --> 00:02:36,110
and we'll let A be the event
that the second coin comes out

30
00:02:36,110 --> 00:02:38,120
heads.

31
00:02:38,120 --> 00:02:41,145
So we know the probability of
A is 1/2 because it's fair.

32
00:02:45,180 --> 00:02:50,300
And because they're
independent, we

33
00:02:50,300 --> 00:02:55,460
can conclude that the
probability of A given B

34
00:02:55,460 --> 00:03:01,780
is 1/2, which is the probability
of A. In other words,

35
00:03:01,780 --> 00:03:03,830
seeing the result
of the second coin

36
00:03:03,830 --> 00:03:09,190
doesn't tell you anything about
the result of the first coin.

37
00:03:09,190 --> 00:03:13,300
Now actually, when
you flip two coins,

38
00:03:13,300 --> 00:03:17,520
it's not just always the
case if they're independent.

39
00:03:17,520 --> 00:03:19,240
Can anybody think
of an example where

40
00:03:19,240 --> 00:03:23,550
you can flip a pair of coins
and they are dependent somehow,

41
00:03:23,550 --> 00:03:25,771
they're not independent?

42
00:03:25,771 --> 00:03:26,270
Yeah.

43
00:03:26,270 --> 00:03:29,640
AUDIENCE: Well, if you have to
get two heads and two tails?

44
00:03:29,640 --> 00:03:33,500
TOM LEIGHTON: If you have to
get two heads or two tails.

45
00:03:33,500 --> 00:03:35,200
Well, how would you have to get?

46
00:03:35,200 --> 00:03:39,426
AUDIENCE: The probability
of getting two heads

47
00:03:39,426 --> 00:03:42,092
should be 1/4 [INAUDIBLE].

48
00:03:42,092 --> 00:03:43,550
TOM LEIGHTON: Well,
then they would

49
00:03:43,550 --> 00:03:45,015
be independent in that case.

50
00:03:45,015 --> 00:03:45,515
Yeah.

51
00:03:45,515 --> 00:03:47,223
AUDIENCE: If you glue
the coins together.

52
00:03:47,223 --> 00:03:48,150
TOM LEIGHTON: Yeah.

53
00:03:48,150 --> 00:03:53,160
I mean, this is a silly example,
but I got two fair coins here.

54
00:03:53,160 --> 00:03:57,866
I could clip them together
and now I flip them

55
00:03:57,866 --> 00:04:00,490
and odds are pretty good they're
both going to be heads or both

56
00:04:00,490 --> 00:04:01,990
be tails.

57
00:04:01,990 --> 00:04:04,230
If you know what happened
to the right coin,

58
00:04:04,230 --> 00:04:07,040
it will tell you what
happened to the left coin.

59
00:04:07,040 --> 00:04:10,600
Now, that's a pretty
contrived example,

60
00:04:10,600 --> 00:04:14,000
but it is illustrative of
what happens in practice.

61
00:04:14,000 --> 00:04:17,610
In practice, we
assume independence

62
00:04:17,610 --> 00:04:20,329
even though there can
be subtle dependencies

63
00:04:20,329 --> 00:04:21,620
and this could lead to trouble.

64
00:04:21,620 --> 00:04:23,650
In fact, we're going to
give a lot of examples

65
00:04:23,650 --> 00:04:26,090
where it leads to
trouble today and also

66
00:04:26,090 --> 00:04:27,560
for the rest of the course.

67
00:04:27,560 --> 00:04:30,390
Because we're always going to
want to assume independence

68
00:04:30,390 --> 00:04:33,400
and when we do, we're going
to get very nice results,

69
00:04:33,400 --> 00:04:35,820
but things aren't always
independent in practice

70
00:04:35,820 --> 00:04:39,860
and establishing independence
is a hard thing to do.

71
00:04:39,860 --> 00:04:41,930
For that matter, while
we're on the subject,

72
00:04:41,930 --> 00:04:43,450
we always talk about fair coins.

73
00:04:43,450 --> 00:04:45,650
You flip a coin and it's fair.

74
00:04:45,650 --> 00:04:47,450
You know, that's not
always to either.

75
00:04:47,450 --> 00:04:50,130
There's actually a
famous mathematician

76
00:04:50,130 --> 00:04:55,100
named Persi Diaconis who used
to down the street at Harvard

77
00:04:55,100 --> 00:04:58,070
and he came and
gave a talk one day

78
00:04:58,070 --> 00:05:00,540
at MIT in the math department
and he's a probabalist.

79
00:05:00,540 --> 00:05:03,830
He does probability theory
and is a very cool guy.

80
00:05:03,830 --> 00:05:07,890
And so he flipped a coin,
got a quarter from somebody

81
00:05:07,890 --> 00:05:10,400
in the audience and
flipped it and he

82
00:05:10,400 --> 00:05:13,360
flip that I think 10 or
20 straight times all

83
00:05:13,360 --> 00:05:16,900
the way to the roof,
caught it, turned it over.

84
00:05:16,900 --> 00:05:18,812
Every time it was heads.

85
00:05:18,812 --> 00:05:22,190
And he goes, now what's the
probability of that happening?

86
00:05:22,190 --> 00:05:24,800
Well, you know, it's 1/2
to the 20th or whatever,

87
00:05:24,800 --> 00:05:26,130
not very likely.

88
00:05:26,130 --> 00:05:28,730
How could he always
make it come out heads?

89
00:05:28,730 --> 00:05:31,780
Well, Persi was an unusual
guy and in fact, he'd

90
00:05:31,780 --> 00:05:37,480
spent months in the strobe
lab over at Harvard practicing

91
00:05:37,480 --> 00:05:41,794
to make it always rotate seven
times, three of them on the way

92
00:05:41,794 --> 00:05:43,460
up, one at the top,
and then three down.

93
00:05:43,460 --> 00:05:46,280
And he could actually
see how many rotations it

94
00:05:46,280 --> 00:05:48,420
had done to make
sure it was seven,

95
00:05:48,420 --> 00:05:50,950
so it always came out heads.

96
00:05:50,950 --> 00:05:52,560
Now, he is an unusual fellow.

97
00:05:52,560 --> 00:05:56,360
He was 1 of 10 people
in the world that

98
00:05:56,360 --> 00:06:00,482
could do a perfect shuffle
reliably on a deck of cards

99
00:06:00,482 --> 00:06:01,940
and that's a very
hard thing to do.

100
00:06:01,940 --> 00:06:06,330
He said he had to practice 8
hours a day for over six months

101
00:06:06,330 --> 00:06:07,722
to be able to do it every time.

102
00:06:07,722 --> 00:06:09,930
In fact, he gave another
talk at MIT where he came in

103
00:06:09,930 --> 00:06:13,520
and he made magic tricks,
actually based on mathematics.

104
00:06:13,520 --> 00:06:17,440
And you would cut a deck,
he would feel it like this

105
00:06:17,440 --> 00:06:19,500
and tell you where you
cut, how many cards

106
00:06:19,500 --> 00:06:22,390
were in the part you picked
up and then do his eight

107
00:06:22,390 --> 00:06:24,215
perfect shuffles, which
is enough to return

108
00:06:24,215 --> 00:06:28,720
a normal 52-card deck back
to its original order.

109
00:06:28,720 --> 00:06:31,170
And then using this,
he could play the game

110
00:06:31,170 --> 00:06:34,360
where pick any card,
you stick it in,

111
00:06:34,360 --> 00:06:37,719
he feels where the card went,
and then using mathematics,

112
00:06:37,719 --> 00:06:39,260
he could shuffle
the deck eight times

113
00:06:39,260 --> 00:06:42,670
and make the card come out
anywhere he wanted in the deck.

114
00:06:42,670 --> 00:06:46,000
So he had a lot going
on upstairs too.

115
00:06:46,000 --> 00:06:47,910
He had an interesting
life history.

116
00:06:47,910 --> 00:06:50,880
He ran away from
home as a young child

117
00:06:50,880 --> 00:06:54,150
and joined the traveling circus.

118
00:06:54,150 --> 00:06:57,580
And then somehow from there, he
joined the faculty at Harvard.

119
00:06:57,580 --> 00:07:01,190
You know, there's
an amazing story.

120
00:07:01,190 --> 00:07:02,897
And actually your
story about Persi

121
00:07:02,897 --> 00:07:06,190
is he was the first
guy to get kicked out

122
00:07:06,190 --> 00:07:08,060
of casinos for card counting.

123
00:07:08,060 --> 00:07:12,970
He figured that out way before
the MIT team and the movie 21.

124
00:07:12,970 --> 00:07:14,720
Down in Puerto Rico,
he used to play

125
00:07:14,720 --> 00:07:19,470
and then they finally figured
him out and he got booted.

126
00:07:19,470 --> 00:07:26,230
So back to independence, let's
do another picture example.

127
00:07:26,230 --> 00:07:29,720
Say that my sample
space looks like this

128
00:07:29,720 --> 00:07:34,650
and I've got two events, A
and B and they look like this,

129
00:07:34,650 --> 00:07:36,456
so they're dis-joined.

130
00:07:36,456 --> 00:07:40,340
Are A and B independent?

131
00:07:40,340 --> 00:07:41,410
No.

132
00:07:41,410 --> 00:07:47,780
In fact what is the probability
of A given B as I've drawn it?

133
00:07:47,780 --> 00:07:48,560
AUDIENCE: 0.

134
00:07:48,560 --> 00:07:49,940
TOM LEIGHTON: 0.

135
00:07:49,940 --> 00:07:54,371
Because if B occurs,
you're outside of A.

136
00:07:54,371 --> 00:07:57,175
And so this does not
equal the probability of A

137
00:07:57,175 --> 00:07:59,860
as long as it's not 0.

138
00:07:59,860 --> 00:08:05,000
So disjoint events don't imply
that they're independent.

139
00:08:12,730 --> 00:08:15,840
Now, what's the picture
look like for them

140
00:08:15,840 --> 00:08:16,630
to be independent?

141
00:08:16,630 --> 00:08:19,926
What is the right
picture to draw here?

142
00:08:19,926 --> 00:08:22,880
So I got my sample
space and say I

143
00:08:22,880 --> 00:08:28,070
make this half the sample
space be A. Well, then B

144
00:08:28,070 --> 00:08:31,110
to be independent,
would look something--

145
00:08:31,110 --> 00:08:32,175
I didn't quite draw it.

146
00:08:32,175 --> 00:08:35,309
I actually have it be 50-50.

147
00:08:35,309 --> 00:08:41,860
So if a is 50% of S,
like this half, then

148
00:08:41,860 --> 00:08:47,360
for A to be independent of
B, A intersect B, this part,

149
00:08:47,360 --> 00:08:53,920
has to be 50% of B. Because
the probability of A given B

150
00:08:53,920 --> 00:08:58,081
must equal the probability
of A to be independent.

151
00:08:58,081 --> 00:09:00,330
So this would be a picture
where they are independent.

152
00:09:03,534 --> 00:09:04,950
Now, independent
events are really

153
00:09:04,950 --> 00:09:07,490
nice to work with and
in part because they

154
00:09:07,490 --> 00:09:10,170
have a very simple
rule for computing

155
00:09:10,170 --> 00:09:15,070
the probability of an
intersection of events

156
00:09:15,070 --> 00:09:19,015
and it's called the product
rule for independent events.

157
00:09:34,150 --> 00:09:44,620
And that says that if
A is independent of B,

158
00:09:44,620 --> 00:09:51,000
then the probability of
A and B or A intersect B

159
00:09:51,000 --> 00:09:54,680
is just the product
of their probabilities

160
00:09:54,680 --> 00:09:59,290
separately, the probability of
A times the probability of B.

161
00:09:59,290 --> 00:10:00,530
So let's prove this.

162
00:10:05,630 --> 00:10:09,390
And there's two cases, depending
on whether or not B can happen,

163
00:10:09,390 --> 00:10:12,490
if the probability
of B is 0 or not.

164
00:10:12,490 --> 00:10:18,880
So case 1 is B can't happen.

165
00:10:18,880 --> 00:10:22,220
The probability of B is 0.

166
00:10:22,220 --> 00:10:26,980
In this case, what's the
probability of A and B?

167
00:10:30,680 --> 00:10:32,330
B can't happen.

168
00:10:32,330 --> 00:10:33,170
0.

169
00:10:33,170 --> 00:10:36,620
If B can't happen,
then they both can't.

170
00:10:36,620 --> 00:10:38,800
You can't have both
of them happening

171
00:10:38,800 --> 00:10:47,080
and that equals the probability
of A times the probability of B

172
00:10:47,080 --> 00:10:48,670
because the
probability of B is 0.

173
00:10:48,670 --> 00:10:49,730
So that case works.

174
00:10:53,170 --> 00:10:58,010
Case 2 is the probability
of B is bigger than 0.

175
00:11:01,720 --> 00:11:05,490
In that case, we have
the probability of A

176
00:11:05,490 --> 00:11:10,980
and B, A intersect B,
well, from the definition,

177
00:11:10,980 --> 00:11:15,390
is the probability of B times
the probability of A given

178
00:11:15,390 --> 00:11:19,270
B. We did that last time.

179
00:11:19,270 --> 00:11:24,370
And by independence, this
is just the probability of A

180
00:11:24,370 --> 00:11:27,270
because A is independent
of B, so we're done.

181
00:11:33,280 --> 00:11:38,420
In fact, many texts
will define independence

182
00:11:38,420 --> 00:11:40,460
by this product rule.

183
00:11:40,460 --> 00:11:42,990
Many texts will say
that A and B are

184
00:11:42,990 --> 00:11:46,320
independent if this is true.

185
00:11:46,320 --> 00:11:48,070
And it's equivalent,
it turns out.

186
00:11:48,070 --> 00:11:51,810
We won't prove that here, but if
you use this as the definition,

187
00:11:51,810 --> 00:11:54,084
then you can derive our
definition as a result.

188
00:11:54,084 --> 00:11:56,250
So this is an equivalent
definition of independence.

189
00:11:59,960 --> 00:12:03,570
Another nice fact about
independent events

190
00:12:03,570 --> 00:12:06,695
is that it's a
symmetric relationship.

191
00:12:12,210 --> 00:12:13,925
It's called the symmetry
of independence.

192
00:12:26,330 --> 00:12:34,540
That says that if they
A is independent of B,

193
00:12:34,540 --> 00:12:36,440
then the reverse is true.

194
00:12:36,440 --> 00:12:46,240
B is independent of A.
Now, we won't prove that.

195
00:12:46,240 --> 00:12:48,590
It's actually easier
to see that it's true

196
00:12:48,590 --> 00:12:52,380
if this were the
definition of independence

197
00:12:52,380 --> 00:12:54,810
because A intersect
B is the same

198
00:12:54,810 --> 00:12:59,750
as B intersect A and
multiplication is commutative.

199
00:12:59,750 --> 00:13:03,980
So it's easier to see it if
we had used that definition.

200
00:13:03,980 --> 00:13:07,150
So because of this
we often just say

201
00:13:07,150 --> 00:13:09,760
A and B are independent
because it doesn't matter which

202
00:13:09,760 --> 00:13:13,450
order you're taking them in.

203
00:13:13,450 --> 00:13:19,171
All right, any questions
about the definition so far?

204
00:13:19,171 --> 00:13:19,970
All right.

205
00:13:19,970 --> 00:13:21,520
Let's do some examples.

206
00:13:34,590 --> 00:13:37,380
Let's say I have two
independent fair coins.

207
00:13:47,730 --> 00:13:52,970
And I'm going to
have the event A be

208
00:13:52,970 --> 00:13:57,520
the situation when the coins
match, both heads, both tails.

209
00:14:01,700 --> 00:14:06,190
And B is going to be the event
that the first coin is heads.

210
00:14:12,420 --> 00:14:15,350
And I want to know, are
A and B independent?

211
00:14:18,300 --> 00:14:19,896
Are those independent events?

212
00:14:22,995 --> 00:14:27,650
Well, what's the
first answer to this?

213
00:14:27,650 --> 00:14:31,290
I mean, A is event
the coins match.

214
00:14:31,290 --> 00:14:33,670
B tells me what
the first coin was.

215
00:14:33,670 --> 00:14:35,650
So the first inclination
here is that these

216
00:14:35,650 --> 00:14:39,960
are dependent events
because I know something

217
00:14:39,960 --> 00:14:41,760
about the first coin,
so that might tell me

218
00:14:41,760 --> 00:14:44,510
something about the
probability they match.

219
00:14:44,510 --> 00:14:47,080
There could be some
dependence here.

220
00:14:47,080 --> 00:14:50,110
Now, in fact, because it's
set up, they're independent

221
00:14:50,110 --> 00:14:53,390
and we can check that by
just doing the calculation,

222
00:14:53,390 --> 00:14:59,220
computing the probability of A
given B. Maybe I can do that.

223
00:14:59,220 --> 00:15:00,910
I'll do that here.

224
00:15:00,910 --> 00:15:09,320
The probability of
A given B is, well,

225
00:15:09,320 --> 00:15:12,190
the condition that they're
going to match given

226
00:15:12,190 --> 00:15:14,370
that the first
point is heads means

227
00:15:14,370 --> 00:15:18,280
it's the same as the
second coin being heads.

228
00:15:18,280 --> 00:15:25,630
This is the probability the
second coin is heads and that's

229
00:15:25,630 --> 00:15:30,732
just 1/2 because it's a
fair coin and independent

230
00:15:30,732 --> 00:15:31,440
of the first one.

231
00:15:31,440 --> 00:15:38,370
Now, the probability of A, by
itself, the events the coins

232
00:15:38,370 --> 00:15:39,860
match, what's that?

233
00:15:39,860 --> 00:15:40,670
How much is that?

234
00:15:48,230 --> 00:15:49,855
What's the probability
the coins match?

235
00:15:52,640 --> 00:15:53,945
AUDIENCE: [INAUDIBLE].

236
00:15:53,945 --> 00:15:55,680
TOM LEIGHTON: 1/4 plus 1/4.

237
00:15:55,680 --> 00:15:58,020
I've got 1/4 chance
of heads, heads

238
00:15:58,020 --> 00:16:03,670
1/4 chance of tails, tails,
so it's 1/2, so it works out.

239
00:16:03,670 --> 00:16:08,340
The probability of A given B
equals the probability of A.

240
00:16:08,340 --> 00:16:10,200
They're both 1/2.

241
00:16:10,200 --> 00:16:14,770
So A and B are
independent events

242
00:16:14,770 --> 00:16:16,400
because that's
just the definition

243
00:16:16,400 --> 00:16:19,066
even though it looked like there
might have been some dependence

244
00:16:19,066 --> 00:16:21,170
lurking around here.

245
00:16:21,170 --> 00:16:24,630
Now, this example that I just
did is a little misleading.

246
00:16:24,630 --> 00:16:27,950
The intuition they
probably are dependent

247
00:16:27,950 --> 00:16:30,970
actually is good
intuition in this case

248
00:16:30,970 --> 00:16:37,351
because if I don't have fair
coins, they are dependent.

249
00:16:37,351 --> 00:16:37,850
All right.

250
00:16:37,850 --> 00:16:39,700
So in particular,
let's look at what

251
00:16:39,700 --> 00:16:45,220
happens if the probability
of a heads is p

252
00:16:45,220 --> 00:16:49,560
and the probability of tails
is 1 minus p for both coins.

253
00:16:52,070 --> 00:16:55,730
So let's compute the
probability of A given

254
00:16:55,730 --> 00:17:00,070
B. What is it in this case?

255
00:17:02,214 --> 00:17:04,380
Well, it's the probability
the second coin is heads.

256
00:17:04,380 --> 00:17:04,940
What's that?

257
00:17:08,319 --> 00:17:12,700
p because both of them are
heads with probability, p.

258
00:17:12,700 --> 00:17:14,109
They're independent still.

259
00:17:14,109 --> 00:17:16,130
The two coins are independent.

260
00:17:16,130 --> 00:17:18,369
And now let's look
at the probability

261
00:17:18,369 --> 00:17:20,859
that the coins match.

262
00:17:20,859 --> 00:17:22,530
Well, it's a probability
of heads, heads

263
00:17:22,530 --> 00:17:24,520
and the probability
of tails, tails.

264
00:17:24,520 --> 00:17:28,180
Heads, heads is p times p.

265
00:17:28,180 --> 00:17:30,790
Tails, tails is 1
minus p squared.

266
00:17:34,050 --> 00:17:37,840
So to independent, I
need this to equal that

267
00:17:37,840 --> 00:17:42,330
or to have the
probability of B be 0.

268
00:17:42,330 --> 00:17:50,730
So A and B are
independent if and only

269
00:17:50,730 --> 00:17:54,880
if-- the first
case is probability

270
00:17:54,880 --> 00:18:02,970
B is 0, which means
that p equals 0,

271
00:18:02,970 --> 00:18:05,510
or that has to equal this.

272
00:18:08,470 --> 00:18:15,770
So p would have to equal 1
minus 2p plus 2p squared,

273
00:18:15,770 --> 00:18:19,420
just square that out there.

274
00:18:19,420 --> 00:18:20,750
So let's solve this.

275
00:18:20,750 --> 00:18:26,590
That happens if and only
if 0 equals 1 minus 3p

276
00:18:26,590 --> 00:18:29,630
plus 2p squared.

277
00:18:29,630 --> 00:18:34,090
That's true if and only if
0 equals-- I factor this--

278
00:18:34,090 --> 00:18:40,170
it's 1 minus 2p times 1 minus
p and that's if and only

279
00:18:40,170 --> 00:18:48,510
if p is 1/2 or p
is 1, two roots.

280
00:18:48,510 --> 00:18:51,160
So if the coins are always
heads, they're independent.

281
00:18:51,160 --> 00:18:54,040
If they're always tails,
the events are independent

282
00:18:54,040 --> 00:18:58,300
or if they're fair coins, these
two events are independent.

283
00:18:58,300 --> 00:19:04,280
But anything else, they're
not independent anymore.

284
00:19:04,280 --> 00:19:06,320
Any questions?

285
00:19:06,320 --> 00:19:09,970
And now you can sort
of see if the coins are

286
00:19:09,970 --> 00:19:12,290
likely to be tails
and the first one

287
00:19:12,290 --> 00:19:15,350
comes up heads, that should
influence the probability

288
00:19:15,350 --> 00:19:17,361
the coins match.

289
00:19:17,361 --> 00:19:18,245
It should change.

290
00:19:21,220 --> 00:19:22,290
Questions?

291
00:19:22,290 --> 00:19:22,790
All right.

292
00:19:22,790 --> 00:19:24,790
So there's a nice application
of this to getting

293
00:19:24,790 --> 00:19:27,860
an edge in ultimate Frisbee.

294
00:19:27,860 --> 00:19:30,400
Now, when you're
playing ultimate,

295
00:19:30,400 --> 00:19:33,740
you've got to decide who
gets the Frisbee first.

296
00:19:33,740 --> 00:19:37,590
And sometimes you don't have
a coin to flip, call heads

297
00:19:37,590 --> 00:19:40,860
or tails, but you
do have the Frisbee.

298
00:19:40,860 --> 00:19:44,950
Now, you could flip the Frisbee
and call right side up or not,

299
00:19:44,950 --> 00:19:48,690
but the problem is the Frisbee
is known not to be a fair coin.

300
00:19:48,690 --> 00:19:50,190
When you toss it
up in the air, it's

301
00:19:50,190 --> 00:19:53,980
likely to wind up on, I
guess, the curved edge down.

302
00:19:53,980 --> 00:19:57,690
So that wouldn't be fair
to call heads or tails.

303
00:19:57,690 --> 00:20:00,240
So the standard solution
is to flip the two

304
00:20:00,240 --> 00:20:03,520
Frisbees at the same
time or one Frisbee twice

305
00:20:03,520 --> 00:20:07,570
and somebody calls
same or different,

306
00:20:07,570 --> 00:20:11,440
that the two Frisbees both
come up on the same way

307
00:20:11,440 --> 00:20:14,040
or they come up
different ways and then

308
00:20:14,040 --> 00:20:17,000
if you called it right, you
get to start with a Frisbee.

309
00:20:17,000 --> 00:20:22,700
And the idea behind this is
that that simulates a fair coin,

310
00:20:22,700 --> 00:20:27,640
that the probability that
they're the same is 50-50.

311
00:20:27,640 --> 00:20:28,390
What do you think.

312
00:20:28,390 --> 00:20:32,060
Is that a fair way to
decide who starts first?

313
00:20:32,060 --> 00:20:32,560
Yeah.

314
00:20:32,560 --> 00:20:33,452
AUDIENCE: No.

315
00:20:33,452 --> 00:20:34,510
TOM LEIGHTON: No.

316
00:20:34,510 --> 00:20:35,500
Yeah, that's right.

317
00:20:35,500 --> 00:20:37,740
It's not.

318
00:20:37,740 --> 00:20:43,590
Now, it is in the case
when the coin was fair,

319
00:20:43,590 --> 00:20:45,610
but we know the
Frisbee is not fair.

320
00:20:45,610 --> 00:20:50,040
And in fact, you can see
this from this probability.

321
00:20:50,040 --> 00:20:55,590
This is the probability
of a match, which

322
00:20:55,590 --> 00:21:01,190
is fine at p equal
1/2, but in fact,

323
00:21:01,190 --> 00:21:03,190
if you analyze
this equation, you

324
00:21:03,190 --> 00:21:07,950
find out its minimum
value is at p equals 1/2

325
00:21:07,950 --> 00:21:11,860
and as p starts moving away
from 1/2 towards 0 or to 1,

326
00:21:11,860 --> 00:21:14,520
it gets bigger.

327
00:21:14,520 --> 00:21:17,610
And we know that for
Frisbees, p is not 1/2.

328
00:21:17,610 --> 00:21:23,562
This means that the probability
of a match is better than 50%.

329
00:21:23,562 --> 00:21:25,020
So if you're ever
playing ultimate,

330
00:21:25,020 --> 00:21:27,343
always call same
because you're going

331
00:21:27,343 --> 00:21:29,260
to have a better than
50-50 chance of getting

332
00:21:29,260 --> 00:21:30,720
to start with the Frisbee.

333
00:21:30,720 --> 00:21:32,610
It's not a fair example.

334
00:21:35,390 --> 00:21:38,000
There is another example
of how to make a fair coin

335
00:21:38,000 --> 00:21:42,230
from a biased coin to an
unbiased coin in homework, ways

336
00:21:42,230 --> 00:21:44,760
of doing this that are fair.

337
00:21:44,760 --> 00:21:47,650
Because often you have
biased random numbers

338
00:21:47,650 --> 00:21:50,827
and you want to get unbiased
or maybe you got a fair coin

339
00:21:50,827 --> 00:21:52,660
and you want to make
something that comes up

340
00:21:52,660 --> 00:21:54,570
heads with probability 1/3.

341
00:21:54,570 --> 00:21:57,720
How do you actually do
that in a way that works?

342
00:21:57,720 --> 00:21:59,880
Any questions on that?

343
00:22:04,210 --> 00:22:09,190
The next example is from
the first OJ Simpson trial.

344
00:22:09,190 --> 00:22:12,930
How many people here
know who OJ Simpson is?

345
00:22:12,930 --> 00:22:15,610
OK, so he's still pretty famous.

346
00:22:15,610 --> 00:22:17,340
Now, as you probably
know then he

347
00:22:17,340 --> 00:22:19,040
was a famous football player.

348
00:22:19,040 --> 00:22:23,500
Back when I was a kid, he
was a famous college player,

349
00:22:23,500 --> 00:22:26,850
then he was a famous
pro player and then he

350
00:22:26,850 --> 00:22:29,100
was an actor, famous actor.

351
00:22:29,100 --> 00:22:34,430
And then he was accused of
murdering his wife in a gory

352
00:22:34,430 --> 00:22:37,570
knifing and a friend
of his wife's.

353
00:22:37,570 --> 00:22:41,060
And ultimately, the jury
found him not guilty,

354
00:22:41,060 --> 00:22:43,820
but pretty much everybody in
the country thought he did it.

355
00:22:43,820 --> 00:22:46,050
He looked really guilty.

356
00:22:46,050 --> 00:22:49,870
And it was a big media event,
one of the first big trial

357
00:22:49,870 --> 00:22:51,280
events on TV.

358
00:22:51,280 --> 00:22:54,070
And so all the proceedings
were on TV and everybody

359
00:22:54,070 --> 00:22:54,930
watched them.

360
00:22:54,930 --> 00:22:57,580
We'd all go home to
watch the OJ hearing.

361
00:22:57,580 --> 00:22:59,930
It was amazing.

362
00:22:59,930 --> 00:23:03,270
Now, during the
indictment proceedings,

363
00:23:03,270 --> 00:23:05,920
there was a huge dispute
over what independence

364
00:23:05,920 --> 00:23:09,650
was and does it matter.

365
00:23:09,650 --> 00:23:13,810
The issue arose when the
prosecution witness claimed

366
00:23:13,810 --> 00:23:18,900
that only 1 in 200 Americans
had a certain blood type that

367
00:23:18,900 --> 00:23:22,080
matched the blood type found at
the scene of the crime, which

368
00:23:22,080 --> 00:23:24,450
was alleged to be OJ's blood.

369
00:23:24,450 --> 00:23:26,030
And this was during
the indictment

370
00:23:26,030 --> 00:23:28,720
and back then DNA
tests took a long time

371
00:23:28,720 --> 00:23:30,590
and they weren't ready yet.

372
00:23:30,590 --> 00:23:34,380
And the witness presented
the following facts and this

373
00:23:34,380 --> 00:23:38,320
was the crime lab
guy, the police guy.

374
00:23:52,500 --> 00:23:59,355
He said that 1 in 10 people,
roughly, matched type O blood.

375
00:24:04,000 --> 00:24:10,355
And that 1 in 5 people matched
the Rh factor positive.

376
00:24:13,910 --> 00:24:20,402
And that 1 in 4 people match a
certain kind of marker, which

377
00:24:20,402 --> 00:24:21,610
I don't remember what it was.

378
00:24:21,610 --> 00:24:26,920
We'll just call it marker XYZ,
some other factor of the blood.

379
00:24:26,920 --> 00:24:33,370
And then this conclusion was
that this means that 1 in 200

380
00:24:33,370 --> 00:24:40,720
match all three factors.

381
00:24:40,720 --> 00:24:43,970
And this seems reasonable
because there's

382
00:24:43,970 --> 00:24:49,920
1/10 of the people have O, if 15
of them have positive Rh factor

383
00:24:49,920 --> 00:24:52,400
and then 1/4 of
all of those have

384
00:24:52,400 --> 00:24:57,430
this marker, that's 1 in 200.

385
00:24:57,430 --> 00:25:00,810
Now, it's important
because OJ's blood

386
00:25:00,810 --> 00:25:07,310
and the blood at the crime
scene both matched all three.

387
00:25:07,310 --> 00:25:08,720
So the implication,
of course, is

388
00:25:08,720 --> 00:25:11,620
that OJ is looking like
the guy who did it.

389
00:25:11,620 --> 00:25:16,670
And the question was, well,
is the 1 in 200 really true?

390
00:25:16,670 --> 00:25:19,060
We can sample these
three in the populations

391
00:25:19,060 --> 00:25:23,850
and see they're true, but
is 1 in 200 really true?

392
00:25:23,850 --> 00:25:27,660
Now, it would be if,
in fact, we verified

393
00:25:27,660 --> 00:25:29,700
that 1/5 of the
type O people have

394
00:25:29,700 --> 00:25:33,480
positive and 1/4 of
the O positive people

395
00:25:33,480 --> 00:25:36,500
have the XYZ marker.

396
00:25:36,500 --> 00:25:40,230
But well, we don't necessarily
know that unless we

397
00:25:40,230 --> 00:25:42,820
go figure that out.

398
00:25:42,820 --> 00:25:45,030
If you assume
they're independent,

399
00:25:45,030 --> 00:25:46,192
then it would be true.

400
00:25:46,192 --> 00:25:47,900
The product rule will
tell us that if you

401
00:25:47,900 --> 00:25:50,280
assume they're independent.

402
00:25:50,280 --> 00:25:55,290
So during the trial, a special
math defense counsel showed up,

403
00:25:55,290 --> 00:25:56,790
not part of the
normal defense team,

404
00:25:56,790 --> 00:26:00,480
but he was brought in as
a mathematician and lawyer

405
00:26:00,480 --> 00:26:04,570
and he crosses the
police guy on the stand.

406
00:26:04,570 --> 00:26:07,480
And he asked the
police guy, the lab guy

407
00:26:07,480 --> 00:26:13,350
if it is known that these
three factors are independent.

408
00:26:13,350 --> 00:26:15,162
Well, the poor
police lab guy never

409
00:26:15,162 --> 00:26:16,870
heard the word
independent before, didn't

410
00:26:16,870 --> 00:26:20,380
know what it meant and the
defense counsel proceeded

411
00:26:20,380 --> 00:26:22,310
to crucify him on the stand.

412
00:26:22,310 --> 00:26:24,310
And then in the end, all
he could say was, look,

413
00:26:24,310 --> 00:26:26,226
we just get these things
and we multiply them.

414
00:26:26,226 --> 00:26:29,500
That's what we're
supposed to do.

415
00:26:29,500 --> 00:26:30,950
It was a little scary.

416
00:26:30,950 --> 00:26:33,120
The actual transcript--
you can still get it--

417
00:26:33,120 --> 00:26:34,620
is a little scary.

418
00:26:34,620 --> 00:26:37,990
The same problem arises
today with DNA testing.

419
00:26:37,990 --> 00:26:41,880
Only there, you've got
lots of these things

420
00:26:41,880 --> 00:26:43,720
and you multiply
them all together

421
00:26:43,720 --> 00:26:45,210
and you get
probabilities like one

422
00:26:45,210 --> 00:26:50,260
in many billion
probability of a match.

423
00:26:50,260 --> 00:26:53,410
Now, there's probably a higher
level of science going on

424
00:26:53,410 --> 00:26:56,370
with DNA testing,
but it's even harder

425
00:26:56,370 --> 00:26:59,440
to really establish
independence.

426
00:26:59,440 --> 00:27:01,560
If you assume it, fine.

427
00:27:01,560 --> 00:27:02,690
The math works out great.

428
00:27:02,690 --> 00:27:04,280
You just multiply them together.

429
00:27:04,280 --> 00:27:06,840
But how do you know
it's really true?

430
00:27:06,840 --> 00:27:10,330
How do you know that maybe a lot
of people that have those four

431
00:27:10,330 --> 00:27:14,120
markers and DNA don't happen
to just have the fifth also,

432
00:27:14,120 --> 00:27:17,160
but it really is
totally unrelated.

433
00:27:17,160 --> 00:27:19,210
And to know that
for sure, you got

434
00:27:19,210 --> 00:27:23,060
to test hundreds of millions of
people, which we really haven't

435
00:27:23,060 --> 00:27:26,320
done yet, and not just
a few guys in Detroit

436
00:27:26,320 --> 00:27:29,280
to be able to conclude
independence of 1

437
00:27:29,280 --> 00:27:31,800
in a billion probabilities.

438
00:27:31,800 --> 00:27:33,160
So for us, this is a lot easier.

439
00:27:33,160 --> 00:27:34,826
In the classroom, we
assume independence

440
00:27:34,826 --> 00:27:37,550
and we'll keep doing
that left and right,

441
00:27:37,550 --> 00:27:40,620
but it doesn't mean
it's true in reality.

442
00:27:40,620 --> 00:27:43,380
In fact, in the
last week of class.

443
00:27:43,380 --> 00:27:46,720
We'll talk about how false
assumption of independence

444
00:27:46,720 --> 00:27:50,490
on mortgage failures led to
the subprime mortgage disaster

445
00:27:50,490 --> 00:27:51,780
in the recession.

446
00:27:51,780 --> 00:27:54,162
It was all because of
some mathematics mistakes

447
00:27:54,162 --> 00:27:54,870
that people made.

448
00:27:57,530 --> 00:27:59,910
Now, this example
raises the question of,

449
00:27:59,910 --> 00:28:04,319
what does independence mean when
you have more than two events?

450
00:28:04,319 --> 00:28:06,360
We defined independence
when there is two events,

451
00:28:06,360 --> 00:28:08,280
but here there's three.

452
00:28:08,280 --> 00:28:11,450
And so to be careful, we
got to actually define

453
00:28:11,450 --> 00:28:16,010
dependence among more than
two events and in this case,

454
00:28:16,010 --> 00:28:20,990
we talk about the events as
being mutually independent.

455
00:28:20,990 --> 00:28:22,090
So let me define that.

456
00:28:36,770 --> 00:28:44,330
So if I've got events
A1, A2, up to An,

457
00:28:44,330 --> 00:28:56,120
we say they are
mutually independent if,

458
00:28:56,120 --> 00:29:02,490
and this is a little complicated
notation, but for all i

459
00:29:02,490 --> 00:29:10,740
and for all sets j that
are subsets of the events,

460
00:29:10,740 --> 00:29:19,380
but not including i,
then the probability

461
00:29:19,380 --> 00:29:23,890
that the i-th event occurs
given that all the events

462
00:29:23,890 --> 00:29:29,010
in the subset
occurred, is the same

463
00:29:29,010 --> 00:29:33,280
as the probability of the i-th
event occurring by itself.

464
00:29:33,280 --> 00:29:35,570
Or there's a special
case where the chance

465
00:29:35,570 --> 00:29:37,100
the other events occur is 0.

466
00:29:46,710 --> 00:29:48,740
In other words, a
collection of events

467
00:29:48,740 --> 00:29:52,220
is mutually independent
if any knowledge

468
00:29:52,220 --> 00:29:55,960
about any of the rest of the
events, happening or not,

469
00:29:55,960 --> 00:29:58,470
does not influence the
event you're looking

470
00:29:58,470 --> 00:30:00,970
at for each of those events.

471
00:30:00,970 --> 00:30:03,080
So no information about
any of the other markers

472
00:30:03,080 --> 00:30:06,980
the blood influences the
i-th marker for any i.

473
00:30:06,980 --> 00:30:09,770
The probabilities are unchanged.

474
00:30:09,770 --> 00:30:11,710
Now, there's an equivalent
definitions based

475
00:30:11,710 --> 00:30:14,691
and the product rule.

476
00:30:14,691 --> 00:30:16,860
Let me show you that version
because that's easier

477
00:30:16,860 --> 00:30:17,734
to work with usually.

478
00:30:32,850 --> 00:30:43,880
This is the product
rule form and it

479
00:30:43,880 --> 00:30:50,550
says that A1, A2, up
to An are mutually

480
00:30:50,550 --> 00:31:07,560
independent if for any
subset of the events

481
00:31:07,560 --> 00:31:11,880
the probability of each of those
events in the subset happening,

482
00:31:11,880 --> 00:31:18,364
all them happening,
is simply the product

483
00:31:18,364 --> 00:31:19,780
of their individual
probabilities.

484
00:31:26,810 --> 00:31:30,910
So independence
means that if you

485
00:31:30,910 --> 00:31:33,430
want the probability of a
bunch of events occurring,

486
00:31:33,430 --> 00:31:35,996
just multiply them
out individually.

487
00:31:35,996 --> 00:31:37,370
And that follows
for independence

488
00:31:37,370 --> 00:31:39,600
or it could be the
definition of independence,

489
00:31:39,600 --> 00:31:41,080
depending on how
you want to do it.

490
00:31:41,080 --> 00:31:42,496
So either of these
are good enough

491
00:31:42,496 --> 00:31:47,200
for you to use as a definition
or a result for independence.

492
00:31:47,200 --> 00:31:50,030
And so the blood guy, of course,
is just multiplying them out

493
00:31:50,030 --> 00:31:52,390
because they're assumed
to be independent,

494
00:31:52,390 --> 00:31:55,180
so it's OK that way.

495
00:31:55,180 --> 00:31:56,125
Let's do an example.

496
00:32:08,920 --> 00:32:10,900
So for example, say
we have three events.

497
00:32:15,160 --> 00:32:25,430
A1, A2, and A3 are
mutually independent

498
00:32:25,430 --> 00:32:28,450
if, these are the things
you have to check,

499
00:32:28,450 --> 00:32:35,144
probability A1 and A2
is just the probability

500
00:32:35,144 --> 00:32:36,560
of A1 times the
probability of A2.

501
00:32:39,160 --> 00:32:44,900
Then you'd check that the
probability of A1 and A3

502
00:32:44,900 --> 00:32:48,390
is the product of their
probabilities, A1 and A3.

503
00:32:54,170 --> 00:32:57,250
And you'd check the
probability of A2

504
00:32:57,250 --> 00:33:00,460
and A3 is the product
of their probabilities.

505
00:33:06,050 --> 00:33:07,550
And there's one
more thing to check.

506
00:33:07,550 --> 00:33:10,290
What's that?

507
00:33:10,290 --> 00:33:12,120
All of them.

508
00:33:12,120 --> 00:33:23,090
The probability of all of them
is the product of each of them

509
00:33:23,090 --> 00:33:23,845
together here.

510
00:33:27,940 --> 00:33:29,870
So if you want to show
the three events are

511
00:33:29,870 --> 00:33:33,410
mutually independent, these
are the four things you check.

512
00:33:33,410 --> 00:33:37,598
That's one way to do it, which
is the case of the blood typing

513
00:33:37,598 --> 00:33:38,306
in the situation.

514
00:33:40,816 --> 00:33:41,530
All right.

515
00:33:41,530 --> 00:33:43,630
Let's do an example.

516
00:33:49,470 --> 00:33:53,510
Well, for example, if
I flip three unbiased,

517
00:33:53,510 --> 00:33:55,290
mutually independent coins.

518
00:33:55,290 --> 00:33:58,170
The probability of two of
them being heads is 1/4.

519
00:33:58,170 --> 00:34:01,760
The probability of three being
heads is 1/8 and so forth.

520
00:34:05,010 --> 00:34:07,750
Let's do a trickier example.

521
00:34:07,750 --> 00:34:12,000
This is a question that was on
the final exam a few years ago

522
00:34:12,000 --> 00:34:15,340
and a lot of the
class missed it.

523
00:34:15,340 --> 00:34:16,630
So now we'll do it here.

524
00:34:21,610 --> 00:34:36,719
Say I flip three fair, mutually
independent coins and my events

525
00:34:36,719 --> 00:34:45,380
are going to be A1 is the
event coin 1 matches coin 2.

526
00:34:51,449 --> 00:34:54,730
The second event,
A2, is the event

527
00:34:54,730 --> 00:34:59,420
that coin 2 matches coin 3.

528
00:34:59,420 --> 00:35:03,750
And the third event,
A3, is the event

529
00:35:03,750 --> 00:35:06,865
that coin 3 matches coin 1.

530
00:35:09,980 --> 00:35:15,180
And the question was,
are these three events

531
00:35:15,180 --> 00:35:18,040
mutually independent?

532
00:35:18,040 --> 00:35:21,040
Prove your answer.

533
00:35:21,040 --> 00:35:22,345
Let's try to figure that out.

534
00:35:31,852 --> 00:35:33,810
The coins, of course,
are mutually independent,

535
00:35:33,810 --> 00:35:36,150
but what about these events?

536
00:35:36,150 --> 00:35:37,640
So let's start doing it.

537
00:35:37,640 --> 00:35:41,370
What's the probability one
of the events occurring?

538
00:35:44,510 --> 00:35:49,829
Well, you got to get the
two coins at hand to match,

539
00:35:49,829 --> 00:35:51,370
so that's the
probability of a heads,

540
00:35:51,370 --> 00:35:56,710
heads plus the probability
of a tails, tails.

541
00:35:56,710 --> 00:36:00,580
That's 1/4 plus 1/4 equals 1/2.

542
00:36:05,920 --> 00:36:13,280
Now, the probability of Ai
and Aj, i and j are 1 to 3,

543
00:36:13,280 --> 00:36:16,740
they're different,
but what is a way

544
00:36:16,740 --> 00:36:20,120
of characterizing that case?

545
00:36:20,120 --> 00:36:22,520
Say event 1 occurred
and event 2 occurred,

546
00:36:22,520 --> 00:36:23,770
how would I characterize that?

547
00:36:28,111 --> 00:36:28,610
Yeah.

548
00:36:28,610 --> 00:36:29,782
AUDIENCE: All the same.

549
00:36:29,782 --> 00:36:30,913
TOM LEIGHTON: All of them.

550
00:36:30,913 --> 00:36:31,412
Yeah.

551
00:36:31,412 --> 00:36:34,410
All of the coins are the same
because if A1 and A2 occur,

552
00:36:34,410 --> 00:36:37,290
I know 1 matches
2 a 2 matches 3.

553
00:36:37,290 --> 00:36:40,910
If A1 and A3 happen, 1
matches 2 and 1 matches 3,

554
00:36:40,910 --> 00:36:43,830
so they're all the same
and the same for A2 and A3.

555
00:36:43,830 --> 00:36:46,630
If 2 matches 3 and 3 matches
1, they're all the same.

556
00:36:46,630 --> 00:36:50,430
So this is the same as saying
all three coins are the same.

557
00:36:53,890 --> 00:36:57,160
It could all be heads
or all be tails.

558
00:36:57,160 --> 00:37:02,290
And that's an 8
plus 8, which is 1/4

559
00:37:02,290 --> 00:37:07,900
and that means equals
the probability of Ai

560
00:37:07,900 --> 00:37:11,780
times the probability
of Aj, which is

561
00:37:11,780 --> 00:37:15,850
what I need for independence.

562
00:37:15,850 --> 00:37:19,070
And then they said they're done.

563
00:37:19,070 --> 00:37:23,520
They are independent,
the three events.

564
00:37:23,520 --> 00:37:25,640
You like that answer?

565
00:37:25,640 --> 00:37:28,260
What's missing?

566
00:37:28,260 --> 00:37:29,410
The last case.

567
00:37:29,410 --> 00:37:31,889
They didn't check
the last case and we

568
00:37:31,889 --> 00:37:33,680
got to do that to have
mutual independence.

569
00:37:33,680 --> 00:37:35,681
So let's look at that.

570
00:37:35,681 --> 00:37:42,060
The last case is probability
A1 intersect A2 intersect A3.

571
00:37:42,060 --> 00:37:45,180
What is the probability
that all three events occur?

572
00:37:49,950 --> 00:37:55,560
Well, the coins all
have to match, right?

573
00:37:55,560 --> 00:37:59,816
If all the coins match, all
three events occur, right?

574
00:37:59,816 --> 00:38:01,690
And what's the probability
all 3 coins match?

575
00:38:04,260 --> 00:38:07,770
1/4, just the same
as this, is 1/4.

576
00:38:07,770 --> 00:38:12,090
Does that equal probability
of A1 times the probability

577
00:38:12,090 --> 00:38:15,520
of A2 times the
probability of A3?

578
00:38:20,090 --> 00:38:22,120
What's that?

579
00:38:22,120 --> 00:38:23,680
1/8.

580
00:38:23,680 --> 00:38:25,620
This is 1/8.

581
00:38:25,620 --> 00:38:26,590
They are not equal.

582
00:38:29,270 --> 00:38:32,779
They are not mutually
independent events.

583
00:38:32,779 --> 00:38:33,725
All right?

584
00:38:37,520 --> 00:38:39,702
Any questions about that?

585
00:38:39,702 --> 00:38:42,810
It might well be something like
this on the final this year,

586
00:38:42,810 --> 00:38:44,960
a good, decent chance.

587
00:38:44,960 --> 00:38:47,610
So if you start going along,
looks like they're independent,

588
00:38:47,610 --> 00:38:49,735
but you forget to check
that last case, which shows

589
00:38:49,735 --> 00:38:52,580
they're not mutual independent.

590
00:38:52,580 --> 00:38:56,140
So you've got to check for all
pairs and all subsets of events

591
00:38:56,140 --> 00:38:57,140
for mutual independence.

592
00:39:00,040 --> 00:39:01,960
Any questions about that?

593
00:39:06,370 --> 00:39:09,960
Now, this is actually
an interesting example

594
00:39:09,960 --> 00:39:14,690
because in this case, all
pairs were independent

595
00:39:14,690 --> 00:39:18,800
and when that happens, we give
that a special name and it's

596
00:39:18,800 --> 00:39:22,415
called pairwise independence,
not too surprising.

597
00:39:22,415 --> 00:39:26,180
And that can be
useful because there's

598
00:39:26,180 --> 00:39:28,340
many times where you do
get pairwise independence,

599
00:39:28,340 --> 00:39:30,502
but not mutual independence.

600
00:39:30,502 --> 00:39:31,960
So let me give you
that definition.

601
00:39:38,760 --> 00:39:44,460
So a collection of
events A1 through An

602
00:39:44,460 --> 00:39:55,610
are said to be
pairwise independent

603
00:39:55,610 --> 00:40:04,980
if for all i and j,
where i doesn't equal j,

604
00:40:04,980 --> 00:40:08,330
Ai and Aj are independent.

605
00:40:14,230 --> 00:40:17,620
Now, as we saw in this
example, in this example,

606
00:40:17,620 --> 00:40:21,660
it was pairwise independence
because the probability

607
00:40:21,660 --> 00:40:25,490
of Ai and Aj equaled the
probability of Ai times

608
00:40:25,490 --> 00:40:26,460
the probably of Aj.

609
00:40:26,460 --> 00:40:29,130
For any pair, it was true.

610
00:40:29,130 --> 00:40:32,130
But it doesn't imply
mutual independence.

611
00:40:32,130 --> 00:40:36,160
So pairwise does
not imply mutual.

612
00:40:38,870 --> 00:40:41,240
Mutual would imply
pairwise because it's

613
00:40:41,240 --> 00:40:43,515
true for every subset of events.

614
00:40:46,550 --> 00:40:47,050
All right.

615
00:40:47,050 --> 00:40:53,670
So let's go back for OJ and
see what would have happened.

616
00:40:53,670 --> 00:40:55,990
What can you say about
the probability of a blood

617
00:40:55,990 --> 00:40:59,420
match for a random
person if you only

618
00:40:59,420 --> 00:41:02,230
knew that these factors
were pairwise independent?

619
00:41:04,910 --> 00:41:06,182
Say you only knew that.

620
00:41:06,182 --> 00:41:08,140
You didn't know they were
mutually independent,

621
00:41:08,140 --> 00:41:10,723
but you knew they were pairwise
independent in the population.

622
00:41:14,959 --> 00:41:17,000
What's the best you can
say about the probability

623
00:41:17,000 --> 00:41:21,130
a random person matches that
blood profile, an upper bound

624
00:41:21,130 --> 00:41:23,315
on the probability?

625
00:41:23,315 --> 00:41:23,815
Yeah.

626
00:41:23,815 --> 00:41:24,776
AUDIENCE: 1 in 50.

627
00:41:24,776 --> 00:41:26,100
TOM LEIGHTON: 1 in 50.

628
00:41:26,100 --> 00:41:27,050
Yeah.

629
00:41:27,050 --> 00:41:30,810
So what you can say is 1
in 50, but nothing better.

630
00:41:30,810 --> 00:41:32,785
So let's see why 1 in 50 works.

631
00:41:38,490 --> 00:41:43,370
So let's let M1 be the
event you match here,

632
00:41:43,370 --> 00:41:45,490
M2 be the event you
match their, and M3

633
00:41:45,490 --> 00:41:48,300
be the event you match that.

634
00:41:48,300 --> 00:41:55,000
The probability you
match all three is

635
00:41:55,000 --> 00:41:56,600
upper bounded by
the probability you

636
00:41:56,600 --> 00:42:03,360
match the first two
because matching all three

637
00:42:03,360 --> 00:42:05,230
is a subset of this.

638
00:42:08,220 --> 00:42:11,390
Pairwise independence
means that this is true.

639
00:42:11,390 --> 00:42:13,690
This equals the
probability of matching

640
00:42:13,690 --> 00:42:16,719
the first times the probability
of matching the second.

641
00:42:16,719 --> 00:42:18,260
The probability of
matching the first

642
00:42:18,260 --> 00:42:22,690
is 1/10, probably of
matching the second is 1/5,

643
00:42:22,690 --> 00:42:23,465
so this is 1/50.

644
00:42:26,700 --> 00:42:29,880
And you picked the best two.

645
00:42:29,880 --> 00:42:34,810
You could have picked these two
and said it was at most 1/20

646
00:42:34,810 --> 00:42:38,070
or those two and said
it's at most 1/40.

647
00:42:38,070 --> 00:42:41,760
But you were clever and said,
OK, I'm going to take these two

648
00:42:41,760 --> 00:42:44,590
and use that as my upper
bound, which is 1/50.

649
00:42:44,590 --> 00:42:51,450
And it might well be that 1
in 50 people match all three.

650
00:42:51,450 --> 00:42:54,010
That can well be.

651
00:42:54,010 --> 00:42:57,910
Because maybe whenever you're O
positive, you have marker XYZ.

652
00:42:57,910 --> 00:43:01,725
That's possible, potentially,
unless we find out otherwise.

653
00:43:05,830 --> 00:43:10,149
What if I tell you can't
assume any independence at all?

654
00:43:10,149 --> 00:43:12,440
What can you say about the
probability of a blood match

655
00:43:12,440 --> 00:43:14,400
here for a random person?

656
00:43:14,400 --> 00:43:14,900
Yeah.

657
00:43:14,900 --> 00:43:15,525
AUDIENCE: 1/10.

658
00:43:15,525 --> 00:43:16,566
TOM LEIGHTON: What is it?

659
00:43:16,566 --> 00:43:17,270
AUDIENCE: 1/10.

660
00:43:17,270 --> 00:43:19,500
TOM LEIGHTON: 1/10.

661
00:43:19,500 --> 00:43:21,230
Because if they
match all three, they

662
00:43:21,230 --> 00:43:26,470
match this and that probability
is 1/10, so it's at most 1/10.

663
00:43:26,470 --> 00:43:27,970
And it could be
that everybody who's

664
00:43:27,970 --> 00:43:31,810
O is O positive and has XYZ.

665
00:43:31,810 --> 00:43:34,704
So unless you have
more information,

666
00:43:34,704 --> 00:43:35,870
that's the best you can say.

667
00:43:35,870 --> 00:43:38,822
It might well be
that's the answer.

668
00:43:41,720 --> 00:43:46,120
Any questions about that?

669
00:43:46,120 --> 00:43:47,600
So the assumptions
really matter.

670
00:43:47,600 --> 00:43:50,200
The more independence
you assume,

671
00:43:50,200 --> 00:43:53,370
the better bounds and the
probability you get of a match.

672
00:43:57,106 --> 00:43:58,660
It's a little bit
unrelated to this,

673
00:43:58,660 --> 00:44:00,990
but there was another
mathematics dispute

674
00:44:00,990 --> 00:44:03,380
at the OJ trial.

675
00:44:03,380 --> 00:44:05,750
It turned out the that
OJ had been beating up

676
00:44:05,750 --> 00:44:08,479
Nicole on a fairly regular basis
and there were police records

677
00:44:08,479 --> 00:44:10,020
because after he'd
beat her up, she'd

678
00:44:10,020 --> 00:44:13,250
go in and complain
to the police.

679
00:44:13,250 --> 00:44:18,410
And the prosecution wanted this
evidence admitted at the trial

680
00:44:18,410 --> 00:44:22,410
because if the guy
is a wife beater,

681
00:44:22,410 --> 00:44:25,030
it makes you think that
maybe he killed her.

682
00:44:25,030 --> 00:44:28,810
And the defense lawyers argued
against admitting that evidence

683
00:44:28,810 --> 00:44:33,070
because it wasn't tied to the
actual murder scene in any way

684
00:44:33,070 --> 00:44:35,900
and they argued it would
be prejudicial to the jury

685
00:44:35,900 --> 00:44:39,380
because, of course, if the jury
hears that OJ was beating her,

686
00:44:39,380 --> 00:44:41,650
they might be more likely
to include to convict him

687
00:44:41,650 --> 00:44:43,690
for murdering her.

688
00:44:43,690 --> 00:44:46,620
Now, they got the
math council again

689
00:44:46,620 --> 00:44:50,420
to argue that the reason
you shouldn't admit this

690
00:44:50,420 --> 00:44:54,160
is because the
probability that you

691
00:44:54,160 --> 00:45:02,350
kill your wife, that's K, given
that you batter your wife,

692
00:45:02,350 --> 00:45:07,622
that's B, is 1 in 2,000.

693
00:45:07,622 --> 00:45:09,080
I would have guessed
it was higher,

694
00:45:09,080 --> 00:45:11,680
but the evidence did show that.

695
00:45:11,680 --> 00:45:15,250
And so they said, look, there's
only a 1 in 2,000 chance

696
00:45:15,250 --> 00:45:19,640
that this evidence of
wife beating is relevant

697
00:45:19,640 --> 00:45:23,190
and therefore, it should not
be admitted because there's

698
00:45:23,190 --> 00:45:25,170
a pretty decent chance
if the jury hears this,

699
00:45:25,170 --> 00:45:27,340
they're going to convict him.

700
00:45:27,340 --> 00:45:28,702
That's a pretty good argument.

701
00:45:28,702 --> 00:45:30,660
And usually that kind of
thing, you exclude it.

702
00:45:30,660 --> 00:45:31,090
Yeah.

703
00:45:31,090 --> 00:45:32,455
AUDIENCE: Where did
that number come from?

704
00:45:32,455 --> 00:45:34,880
TOM LEIGHTON: They got
some study and some experts

705
00:45:34,880 --> 00:45:38,932
to come in and say that for
every 2,000 wife beaters,

706
00:45:38,932 --> 00:45:40,640
only one of them
actually kills his wife.

707
00:45:44,340 --> 00:45:48,179
Now, what do you suppose
the prosecution argued back?

708
00:45:48,179 --> 00:45:49,970
They actually argued
back very effectively,

709
00:45:49,970 --> 00:45:53,031
because that's a tough
argument to get by.

710
00:45:53,031 --> 00:45:53,530
Yeah.

711
00:45:53,530 --> 00:45:56,105
AUDIENCE: What's the probability
that you kill your wife

712
00:45:56,105 --> 00:46:00,545
in the first place, that could
be 100 times larger than usual.

713
00:46:00,545 --> 00:46:02,900
TOM LEIGHTON: Well,
that's a good point.

714
00:46:02,900 --> 00:46:05,210
So maybe the probability
of killing your wife

715
00:46:05,210 --> 00:46:11,370
not knowing B, I hope is
pretty small, probably

716
00:46:11,370 --> 00:46:15,534
that's very small,
but I don't know.

717
00:46:15,534 --> 00:46:17,450
But in any case, this
thing you're going from,

718
00:46:17,450 --> 00:46:20,650
say it's 1 in 1 million
to 1 in 2,000, 1 in 2,000

719
00:46:20,650 --> 00:46:25,084
is still too small to be used
as evidence that OJ did it.

720
00:46:25,084 --> 00:46:27,137
AUDIENCE: Frequency he did it.

721
00:46:27,137 --> 00:46:29,220
TOM LEIGHTON: Frequency,
they didn't get into that

722
00:46:29,220 --> 00:46:31,840
because I guess he'd done it a
bunch, but that's a good point.

723
00:46:31,840 --> 00:46:34,720
It could be there's
multiple beatings is higher.

724
00:46:34,720 --> 00:46:37,285
Maybe that's 1 in 200 then.

725
00:46:37,285 --> 00:46:39,160
In fact, that may be
the case because I think

726
00:46:39,160 --> 00:46:41,326
there's probably they say
because if you do it once,

727
00:46:41,326 --> 00:46:42,870
you do it multiple times.

728
00:46:42,870 --> 00:46:44,940
So there's not much more
to be gaining there.

729
00:46:44,940 --> 00:46:46,760
There's a critical
piece of information

730
00:46:46,760 --> 00:46:50,276
we've left out of our
conditional probabilities here.

731
00:46:50,276 --> 00:46:54,297
In fact, the most glaring
piece of all of evidence.

732
00:46:54,297 --> 00:46:55,130
What's missing here?

733
00:46:55,130 --> 00:46:56,440
What haven't we factored in?

734
00:46:56,440 --> 00:46:56,940
Yeah.

735
00:46:56,940 --> 00:46:58,840
AUDIENCE: The probability of B.

736
00:46:58,840 --> 00:47:03,490
TOM LEIGHTON: The probability
of B, that's the battering.

737
00:47:03,490 --> 00:47:07,550
Battering, I don't know what
it is, probably a large number.

738
00:47:07,550 --> 00:47:09,450
Defense would argue
it's large, I guess,

739
00:47:09,450 --> 00:47:13,664
but it shouldn't
matter that much.

740
00:47:13,664 --> 00:47:17,010
AUDIENCE: The probability
that he actually beat her,

741
00:47:17,010 --> 00:47:18,755
given that she threatened him?

742
00:47:18,755 --> 00:47:20,130
TOM LEIGHTON:
Well, there's that,

743
00:47:20,130 --> 00:47:21,949
but they have police--
well, that's true.

744
00:47:21,949 --> 00:47:23,740
They didn't see him
doing it, but let's say

745
00:47:23,740 --> 00:47:25,820
that they had good
evidence that he did it

746
00:47:25,820 --> 00:47:30,750
and defense wasn't arguing
that he didn't really beat her.

747
00:47:30,750 --> 00:47:35,770
The key thing we're missing
here is Nicole wound up dead.

748
00:47:35,770 --> 00:47:37,780
She was dead.

749
00:47:37,780 --> 00:47:40,790
And there's another stat here
that the prosecution argued.

750
00:47:52,040 --> 00:47:53,730
So they argued this fact.

751
00:47:53,730 --> 00:47:56,570
The probability the
husband kills his wife,

752
00:47:56,570 --> 00:48:00,980
given that he batters her
and she wound up dead,

753
00:48:00,980 --> 00:48:05,100
that somebody murder
her is bigger than 1/2.

754
00:48:05,100 --> 00:48:08,950
So here M is somebody
murdered the wife.

755
00:48:08,950 --> 00:48:11,030
Here, the husband beats her.

756
00:48:11,030 --> 00:48:13,110
Now, the conditional
probability that he

757
00:48:13,110 --> 00:48:16,540
killed her is bigger than
1/2 and that's a whopper.

758
00:48:16,540 --> 00:48:18,504
Now, it's very relevant.

759
00:48:18,504 --> 00:48:19,920
The probability
he killed her just

760
00:48:19,920 --> 00:48:22,290
given that he beat her
is only 1 in 2,000,

761
00:48:22,290 --> 00:48:25,230
but if you add the fact, which
is very relevant in this case,

762
00:48:25,230 --> 00:48:30,085
that the wife was murdered,
this is now very compelling.

763
00:48:30,085 --> 00:48:31,960
Now, in fact, they should
have really compare

764
00:48:31,960 --> 00:48:39,840
this to probability he kills
her given that she's dead.

765
00:48:39,840 --> 00:48:43,430
And so that would determine now
the relevance of the battering,

766
00:48:43,430 --> 00:48:45,218
the wife beating.

767
00:48:45,218 --> 00:48:47,342
That's what they should
have done, but they didn't.

768
00:48:47,342 --> 00:48:49,845
They got this far and they
had that and the judge said,

769
00:48:49,845 --> 00:48:51,610
I'm letting it in.

770
00:48:51,610 --> 00:48:53,450
So it came in at that point.

771
00:48:53,450 --> 00:48:55,600
But this would be the
right comparison, I think.

772
00:48:55,600 --> 00:48:57,058
Because you look
at the probability

773
00:48:57,058 --> 00:49:00,287
that you killed her
given that she's dead,

774
00:49:00,287 --> 00:49:02,120
but now the additional
information, the wife

775
00:49:02,120 --> 00:49:04,227
battering, how does that
change the probability?

776
00:49:04,227 --> 00:49:05,810
And it probably
changes it materially.

777
00:49:08,480 --> 00:49:10,450
So it's all a little
gory, but it's

778
00:49:10,450 --> 00:49:13,624
interesting to see how
mathematics played out

779
00:49:13,624 --> 00:49:14,790
in this kind of environment.

780
00:49:14,790 --> 00:49:15,289
Yeah.

781
00:49:15,289 --> 00:49:17,455
AUDIENCE: Are we supposed
to assume that he did

782
00:49:17,455 --> 00:49:18,814
kill his wife?

783
00:49:18,814 --> 00:49:20,700
TOM LEIGHTON: Yes,
and they assumed that,

784
00:49:20,700 --> 00:49:24,170
but when you decide whether
or not to admit evidence,

785
00:49:24,170 --> 00:49:27,850
if it's prejudicial, you've got
to have a really good grounds

786
00:49:27,850 --> 00:49:28,410
to get it in.

787
00:49:28,410 --> 00:49:31,710
Like if the evidence is going to
make the jury think he did it,

788
00:49:31,710 --> 00:49:35,480
then you really got to argue the
evidence is relevant somehow.

789
00:49:35,480 --> 00:49:37,649
There's material
information and that's

790
00:49:37,649 --> 00:49:38,690
what the fight was about.

791
00:49:38,690 --> 00:49:41,740
A 1 in 2,000 relevance
isn't going to cut it.

792
00:49:41,740 --> 00:49:45,569
1 in 2, that's probably
pretty relevant.

793
00:49:45,569 --> 00:49:47,110
And that will be
the grounds on which

794
00:49:47,110 --> 00:49:49,349
the judge makes his decision.

795
00:49:49,349 --> 00:49:50,890
But yeah, you assume
he didn't do it.

796
00:49:56,256 --> 00:49:56,860
All right.

797
00:49:56,860 --> 00:49:57,830
Back to independence.

798
00:49:57,830 --> 00:50:03,410
So the last example today is
derived from a famous paradox

799
00:50:03,410 --> 00:50:05,470
and has several actually
important applications

800
00:50:05,470 --> 00:50:06,880
in computer science.

801
00:50:06,880 --> 00:50:09,340
And this problem is known
as the birthday problem

802
00:50:09,340 --> 00:50:10,340
or the birthday paradox.

803
00:50:13,170 --> 00:50:16,040
It's a paradox because it sort
of has a surprising answer.

804
00:50:20,091 --> 00:50:21,590
Probably a lot of
you have seen this

805
00:50:21,590 --> 00:50:23,318
before in some form or another.

806
00:50:37,860 --> 00:50:45,180
In the birthday problem,
there are N birthdays

807
00:50:45,180 --> 00:50:46,680
and typically
we're going to look

808
00:50:46,680 --> 00:50:53,290
at the case where N is
365, the days of the year,

809
00:50:53,290 --> 00:50:54,350
and there is M people.

810
00:50:59,330 --> 00:51:02,570
And for example, know maybe
there's 100 people here.

811
00:51:07,070 --> 00:51:15,480
And what we want to know
is, what is the probability

812
00:51:15,480 --> 00:51:21,770
that two or more people
have the same birthday.

813
00:51:32,760 --> 00:51:34,260
For example, how
many people think

814
00:51:34,260 --> 00:51:37,450
there's at least a 50%
chance that a pair of you

815
00:51:37,450 --> 00:51:41,460
in the audience here
have the same birthday?

816
00:51:41,460 --> 00:51:42,870
That's good.

817
00:51:42,870 --> 00:51:48,030
How many people think there's
a better than 90% chance?

818
00:51:48,030 --> 00:51:49,382
A few of you.

819
00:51:49,382 --> 00:51:49,930
All right.

820
00:51:49,930 --> 00:51:53,410
How many people think there's
a better than a 99% chance

821
00:51:53,410 --> 00:51:55,700
that there's a pair
of matching birthdays?

822
00:51:55,700 --> 00:51:56,950
A couple left.

823
00:51:56,950 --> 00:52:01,060
How many think it's better
than a 99.9% chance?

824
00:52:01,060 --> 00:52:01,974
We've got one, two.

825
00:52:01,974 --> 00:52:03,390
You guys are going
to be stubborn.

826
00:52:03,390 --> 00:52:03,990
Another one.

827
00:52:03,990 --> 00:52:04,500
All right.

828
00:52:04,500 --> 00:52:11,770
How many people think it's
more than 99.999% chance?

829
00:52:11,770 --> 00:52:13,340
Actually it's six 9's.

830
00:52:13,340 --> 00:52:15,440
It's incredible.

831
00:52:15,440 --> 00:52:18,120
It is a virtual certainty.

832
00:52:18,120 --> 00:52:19,200
So let's see.

833
00:52:19,200 --> 00:52:23,170
In fact, the chance that
you're all different is about 1

834
00:52:23,170 --> 00:52:27,290
in 3 million chance that
you're all different.

835
00:52:27,290 --> 00:52:30,810
And we're going to see
why that's true here.

836
00:52:30,810 --> 00:52:33,410
But to do that, we're
going to need to make

837
00:52:33,410 --> 00:52:36,720
two important assumptions.

838
00:52:36,720 --> 00:52:39,850
Any ideas about what assumptions
you're going to need?

839
00:52:39,850 --> 00:52:40,350
Yeah.

840
00:52:40,350 --> 00:52:42,170
AUDIENCE: Birthdays are
uniformly distributed.

841
00:52:42,170 --> 00:52:43,585
TOM LEIGHTON: Birthdays
are uniformly distributed.

842
00:52:43,585 --> 00:52:44,718
Any other ideas?

843
00:52:44,718 --> 00:52:45,218
Yes.

844
00:52:45,218 --> 00:52:46,496
AUDIENCE: He stole my answer.

845
00:52:46,496 --> 00:52:47,870
TOM LEIGHTON: Oh,
he stole yours.

846
00:52:47,870 --> 00:52:50,724
What else are you going
to need to assume?

847
00:52:50,724 --> 00:52:51,700
Yeah.

848
00:52:51,700 --> 00:52:54,319
AUDIENCE: All birthdays are
independent of each other.

849
00:52:54,319 --> 00:52:55,110
TOM LEIGHTON: Yeah.

850
00:52:55,110 --> 00:52:56,410
Mutually independent.

851
00:52:56,410 --> 00:52:58,380
We're going to
need that as well.

852
00:52:58,380 --> 00:53:03,020
Now, in actuality, neither
is true in reality.

853
00:53:03,020 --> 00:53:04,870
It's well known
that birthdays tend

854
00:53:04,870 --> 00:53:07,700
to follow seasonal
patterns and they're

855
00:53:07,700 --> 00:53:10,070
related to major events.

856
00:53:10,070 --> 00:53:13,590
Now, do you all remember the big
blackout that hit the Northeast

857
00:53:13,590 --> 00:53:14,830
several years ago?

858
00:53:14,830 --> 00:53:16,710
Do you remember that?

859
00:53:16,710 --> 00:53:18,660
Well, it turns out,
this is a true fact,

860
00:53:18,660 --> 00:53:21,850
there were a lot of babies
born nine months later.

861
00:53:21,850 --> 00:53:23,020
In fact, they had a name.

862
00:53:23,020 --> 00:53:24,730
They're called blackout babies.

863
00:53:24,730 --> 00:53:27,950
If you were born in that period
in the Northeast and there's

864
00:53:27,950 --> 00:53:31,770
all these news stories about
the life of the blackout babies.

865
00:53:31,770 --> 00:53:34,370
And the same thing happens
after cold snaps in the winter

866
00:53:34,370 --> 00:53:36,860
and you get a blizzard
or this kind of a thing.

867
00:53:36,860 --> 00:53:39,150
Nine months later,
you get babies.

868
00:53:39,150 --> 00:53:44,040
In fact, I had a personal
experience with this.

869
00:53:44,040 --> 00:53:51,090
Well, my son was born
on October 18, 1996.

870
00:53:51,090 --> 00:53:54,100
And on the day he was born,
we're going to the hospital

871
00:53:54,100 --> 00:53:55,720
and it was a zoo.

872
00:53:55,720 --> 00:53:58,190
The maternity ward
was totally full.

873
00:53:58,190 --> 00:54:01,050
We had to go at some other
wing of the hospital.

874
00:54:01,050 --> 00:54:05,300
And babies were popping
out all over the place.

875
00:54:05,300 --> 00:54:07,040
And I asked, what is going on?

876
00:54:07,040 --> 00:54:10,520
Why don't you have enough
room for all the mothers here?

877
00:54:10,520 --> 00:54:12,520
And they said, oh, it's
all the blizzard babies.

878
00:54:12,520 --> 00:54:13,790
And I go, what?

879
00:54:13,790 --> 00:54:16,100
And they go, well, remember
the blizzard of '96?

880
00:54:16,100 --> 00:54:18,562
It's like, oh yeah.

881
00:54:18,562 --> 00:54:19,528
I remember.

882
00:54:19,528 --> 00:54:20,980
Yeah.

883
00:54:20,980 --> 00:54:23,930
It was nine months prior
is the big blizzard

884
00:54:23,930 --> 00:54:28,000
and so it's all the
blizzard babies coming.

885
00:54:28,000 --> 00:54:30,122
So they're not uniform.

886
00:54:30,122 --> 00:54:31,830
They're all different
probabilities here,

887
00:54:31,830 --> 00:54:33,965
but we're going to assume
they're equally likely.

888
00:54:36,570 --> 00:54:41,920
Now, independence is also
not true, in general.

889
00:54:41,920 --> 00:54:47,210
What's one way that birthdays
might not be independent?

890
00:54:47,210 --> 00:54:47,935
What is it?

891
00:54:47,935 --> 00:54:48,601
AUDIENCE: Twins.

892
00:54:48,601 --> 00:54:49,540
TOM LEIGHTON: Twins.

893
00:54:49,540 --> 00:54:52,500
So if they're twins, they
have the same birthday.

894
00:54:52,500 --> 00:54:53,500
Now, there's other ways.

895
00:54:53,500 --> 00:54:56,990
In fact, my only
sibling, my brother,

896
00:54:56,990 --> 00:55:00,880
has the same birthday I do,
but I'm two years older,

897
00:55:00,880 --> 00:55:02,680
so we weren't twins.

898
00:55:02,680 --> 00:55:05,780
Now, you say, what
are the odds of that?

899
00:55:05,780 --> 00:55:10,200
Well, 1 in 365, you think.

900
00:55:10,200 --> 00:55:11,800
Well, one day I'm
in middle school,

901
00:55:11,800 --> 00:55:14,040
about the age you start
thinking about these things,

902
00:55:14,040 --> 00:55:16,060
and you get the idea to
count back nine months

903
00:55:16,060 --> 00:55:17,160
from your birthday.

904
00:55:17,160 --> 00:55:18,940
Probably some of
you have done that.

905
00:55:18,940 --> 00:55:24,490
And I did that and
that's my dad's birthday.

906
00:55:24,490 --> 00:55:25,914
I was like, oh.

907
00:55:28,520 --> 00:55:32,100
May is not 1 in 365.

908
00:55:32,100 --> 00:55:35,810
It's like, Happy Birthday.

909
00:55:35,810 --> 00:55:36,860
I don't know.

910
00:55:36,860 --> 00:55:39,160
Anyway, I almost needed to
go into therapy after that,

911
00:55:39,160 --> 00:55:40,800
you know.

912
00:55:40,800 --> 00:55:45,810
So now you all got to count back
nine months from your birthday.

913
00:55:45,810 --> 00:55:49,100
Anybody whose birthday is on
September 30 or October 1,

914
00:55:49,100 --> 00:55:51,360
nine months back
is New Year's Eve.

915
00:55:51,360 --> 00:55:53,100
That's dangerous.

916
00:55:53,100 --> 00:55:57,100
So in reality, birthdays
are not independent

917
00:55:57,100 --> 00:55:59,610
and they are not
randomly distributed,

918
00:55:59,610 --> 00:56:02,280
but we're going to
assume that because we're

919
00:56:02,280 --> 00:56:05,510
going to use this same analysis
for computer science problems

920
00:56:05,510 --> 00:56:09,831
where things are, hopefully,
more independent and random.

921
00:56:09,831 --> 00:56:11,330
Now, we're going
to do an experiment

922
00:56:11,330 --> 00:56:14,502
to see how many people
it takes us to get

923
00:56:14,502 --> 00:56:15,710
a pair of matching birthdays.

924
00:56:15,710 --> 00:56:18,220
So I'm going to run through
people in order in the rows

925
00:56:18,220 --> 00:56:20,440
here, get your birthday
and we're going to record

926
00:56:20,440 --> 00:56:22,680
and we're going to see how
far we go until there's

927
00:56:22,680 --> 00:56:24,880
a match in that group.

928
00:56:24,880 --> 00:56:26,370
So I will write up
the months here.

929
00:56:51,230 --> 00:56:55,950
And we'll start with my
birthday is October 28.

930
00:56:55,950 --> 00:56:57,180
So let's go right across.

931
00:56:57,180 --> 00:56:57,830
What yours?

932
00:56:57,830 --> 00:56:59,120
AUDIENCE: April 1.

933
00:56:59,120 --> 00:57:00,230
TOM LEIGHTON: April 1.

934
00:57:06,610 --> 00:57:07,260
OK.

935
00:57:07,260 --> 00:57:08,810
We won't embarrass you here.

936
00:57:08,810 --> 00:57:10,930
OK, who's next?

937
00:57:10,930 --> 00:57:12,283
What's your birthday?

938
00:57:12,283 --> 00:57:13,672
AUDIENCE: I'm sorry.

939
00:57:13,672 --> 00:57:14,600
September 2.

940
00:57:14,600 --> 00:57:15,910
TOM LEIGHTON: September 2.

941
00:57:15,910 --> 00:57:18,020
All right.

942
00:57:18,020 --> 00:57:18,907
Yours.

943
00:57:18,907 --> 00:57:19,615
AUDIENCE: June 1.

944
00:57:19,615 --> 00:57:22,380
TOM LEIGHTON: June 1.

945
00:57:22,380 --> 00:57:22,880
OK.

946
00:57:22,880 --> 00:57:23,560
We'll come back.

947
00:57:23,560 --> 00:57:24,310
AUDIENCE: April 8.

948
00:57:24,310 --> 00:57:24,940
TOM LEIGHTON: What is it?

949
00:57:24,940 --> 00:57:25,880
AUDIENCE: April 8.

950
00:57:25,880 --> 00:57:27,481
TOM LEIGHTON: April 8.

951
00:57:27,481 --> 00:57:28,954
All right.

952
00:57:28,954 --> 00:57:30,427
AUDIENCE: November 20.

953
00:57:30,427 --> 00:57:32,992
TOM LEIGHTON: November 20.

954
00:57:32,992 --> 00:57:34,796
AUDIENCE: June 12.

955
00:57:34,796 --> 00:57:37,736
TOM LEIGHTON: June 12.

956
00:57:37,736 --> 00:57:39,640
AUDIENCE: December 29.

957
00:57:39,640 --> 00:57:41,952
TOM LEIGHTON: December 29.

958
00:57:41,952 --> 00:57:44,162
AUDIENCE: [INAUDIBLE].

959
00:57:44,162 --> 00:57:45,260
TOM LEIGHTON: What is it?

960
00:57:45,260 --> 00:57:46,009
AUDIENCE: June 14.

961
00:57:46,009 --> 00:57:47,480
TOM LEIGHTON: June 14.

962
00:57:47,480 --> 00:57:48,820
Ooh, I almost got one there.

963
00:57:48,820 --> 00:57:50,045
That one's close.

964
00:57:50,045 --> 00:57:51,229
All right.

965
00:57:51,229 --> 00:57:51,770
What's yours?

966
00:57:51,770 --> 00:57:53,260
AUDIENCE: March 6.

967
00:57:53,260 --> 00:57:55,000
TOM LEIGHTON: March 6.

968
00:57:55,000 --> 00:57:56,344
AUDIENCE: May 2.

969
00:57:56,344 --> 00:57:58,650
TOM LEIGHTON: May 2.

970
00:57:58,650 --> 00:58:00,497
AUDIENCE: 17th of November.

971
00:58:00,497 --> 00:58:01,580
TOM LEIGHTON: November 17.

972
00:58:01,580 --> 00:58:02,530
Close again.

973
00:58:02,530 --> 00:58:04,695
AUDIENCE: August 4.

974
00:58:04,695 --> 00:58:07,028
TOM LEIGHTON: August 4.

975
00:58:07,028 --> 00:58:08,980
AUDIENCE: July 25.

976
00:58:08,980 --> 00:58:10,785
TOM LEIGHTON: July 25.

977
00:58:10,785 --> 00:58:14,170
I don't think we'll get
to 100 here, hopefully.

978
00:58:14,170 --> 00:58:15,175
Yeah, what's yours?

979
00:58:15,175 --> 00:58:16,050
AUDIENCE: October 30.

980
00:58:16,050 --> 00:58:16,795
TOM LEIGHTON: What is it?

981
00:58:16,795 --> 00:58:17,710
AUDIENCE: October 30.

982
00:58:17,710 --> 00:58:18,751
TOM LEIGHTON: October 30.

983
00:58:18,751 --> 00:58:20,716
Got close.

984
00:58:20,716 --> 00:58:22,012
AUDIENCE: July 6.

985
00:58:22,012 --> 00:58:23,320
TOM LEIGHTON: July 6.

986
00:58:23,320 --> 00:58:24,726
All right.

987
00:58:24,726 --> 00:58:26,214
AUDIENCE: February 25.

988
00:58:26,214 --> 00:58:27,440
TOM LEIGHTON: February 25.

989
00:58:30,580 --> 00:58:32,228
AUDIENCE: May 21.

990
00:58:32,228 --> 00:58:33,930
TOM LEIGHTON: May what?

991
00:58:33,930 --> 00:58:37,630
21st of May.

992
00:58:37,630 --> 00:58:38,906
AUDIENCE: May 30.

993
00:58:38,906 --> 00:58:41,820
TOM LEIGHTON: May 30.

994
00:58:41,820 --> 00:58:43,630
You guys fooled me.

995
00:58:43,630 --> 00:58:44,660
What have you got?

996
00:58:44,660 --> 00:58:45,815
AUDIENCE: January 12.

997
00:58:45,815 --> 00:58:47,890
TOM LEIGHTON: January 12.

998
00:58:47,890 --> 00:58:48,852
All right.

999
00:58:48,852 --> 00:58:50,696
AUDIENCE: July 14.

1000
00:58:50,696 --> 00:58:52,280
TOM LEIGHTON: July 14.

1001
00:58:54,955 --> 00:58:55,455
OK.

1002
00:58:55,455 --> 00:58:57,155
AUDIENCE: April 30.

1003
00:58:57,155 --> 00:59:00,303
TOM LEIGHTON: April 30.

1004
00:59:00,303 --> 00:59:02,067
AUDIENCE: March 13.

1005
00:59:02,067 --> 00:59:05,360
TOM LEIGHTON: March 13.

1006
00:59:05,360 --> 00:59:06,000
All right.

1007
00:59:06,000 --> 00:59:06,610
Did I get--

1008
00:59:06,610 --> 00:59:07,705
AUDIENCE: October 7.

1009
00:59:07,705 --> 00:59:10,044
TOM LEIGHTON: October 7.

1010
00:59:10,044 --> 00:59:11,460
AUDIENCE: October 8.

1011
00:59:11,460 --> 00:59:13,170
TOM LEIGHTON: Ah, you guys.

1012
00:59:16,376 --> 00:59:17,750
OK.

1013
00:59:17,750 --> 00:59:18,470
Did I get you?

1014
00:59:18,470 --> 00:59:19,740
AUDIENCE: September 15.

1015
00:59:19,740 --> 00:59:22,581
TOM LEIGHTON: September 15.

1016
00:59:22,581 --> 00:59:24,916
AUDIENCE: November 9.

1017
00:59:24,916 --> 00:59:26,464
TOM LEIGHTON: November 9.

1018
00:59:26,464 --> 00:59:26,980
All right.

1019
00:59:26,980 --> 00:59:27,730
AUDIENCE: July 15.

1020
00:59:27,730 --> 00:59:32,190
TOM LEIGHTON: July 15.

1021
00:59:32,190 --> 00:59:33,306
Close.

1022
00:59:33,306 --> 00:59:34,614
AUDIENCE: September 3.

1023
00:59:34,614 --> 00:59:36,280
TOM LEIGHTON: September 3.

1024
00:59:36,280 --> 00:59:38,680
You guys are killing me here.

1025
00:59:38,680 --> 00:59:40,156
AUDIENCE: February 6.

1026
00:59:40,156 --> 00:59:41,970
TOM LEIGHTON: February 6.

1027
00:59:41,970 --> 00:59:44,754
AUDIENCE: October 26.

1028
00:59:44,754 --> 00:59:46,514
TOM LEIGHTON: OK.

1029
00:59:46,514 --> 00:59:48,834
AUDIENCE: November 2.

1030
00:59:48,834 --> 00:59:51,163
TOM LEIGHTON: November 2.

1031
00:59:51,163 --> 00:59:54,121
AUDIENCE: January 23.

1032
00:59:54,121 --> 00:59:56,578
TOM LEIGHTON: January 23.

1033
00:59:56,578 --> 00:59:59,434
AUDIENCE: September 27.

1034
00:59:59,434 --> 01:00:02,230
TOM LEIGHTON: You guys are going
to set a record for sure here.

1035
01:00:02,230 --> 01:00:03,890
This isn't the way
it's supposed to go.

1036
01:00:03,890 --> 01:00:05,292
AUDIENCE: December 30.

1037
01:00:05,292 --> 01:00:06,530
TOM LEIGHTON: December 30.

1038
01:00:09,290 --> 01:00:10,680
AUDIENCE: December 28.

1039
01:00:10,680 --> 01:00:12,975
TOM LEIGHTON: Ah, come on, guys.

1040
01:00:15,730 --> 01:00:18,410
What is the probability
of going this long here?

1041
01:00:18,410 --> 01:00:19,122
Yeah.

1042
01:00:19,122 --> 01:00:20,420
AUDIENCE: September 22.

1043
01:00:20,420 --> 01:00:23,130
TOM LEIGHTON: September 22.

1044
01:00:23,130 --> 01:00:26,399
AUDIENCE: July 30.

1045
01:00:26,399 --> 01:00:29,396
TOM LEIGHTON: July 30.

1046
01:00:29,396 --> 01:00:32,330
AUDIENCE: The 24th of August.

1047
01:00:32,330 --> 01:00:33,740
TOM LEIGHTON: 24th August.

1048
01:00:33,740 --> 01:00:35,864
I'm going to have to ask
the same person to tell me

1049
01:00:35,864 --> 01:00:37,110
twice here to get a match.

1050
01:00:37,110 --> 01:00:38,150
We got over there now?

1051
01:00:38,150 --> 01:00:39,460
AUDIENCE: April 6.

1052
01:00:39,460 --> 01:00:42,141
TOM LEIGHTON: April 6.

1053
01:00:42,141 --> 01:00:44,396
AUDIENCE: October 16.

1054
01:00:44,396 --> 01:00:45,588
TOM LEIGHTON: October 16.

1055
01:00:45,588 --> 01:00:47,500
AUDIENCE: Did ask how many--

1056
01:00:47,500 --> 01:00:48,452
AUDIENCE: September 3.

1057
01:00:48,452 --> 01:00:50,460
TOM LEIGHTON: September 3.

1058
01:00:50,460 --> 01:00:52,545
All right.

1059
01:00:52,545 --> 01:00:55,275
Very good.

1060
01:00:55,275 --> 01:00:56,190
All right.

1061
01:00:56,190 --> 01:01:00,230
Let's count and see
how many we got here.

1062
01:01:00,230 --> 01:01:02,912
1, 2, 3, 4, 5, 6, 7, 8.

1063
01:01:02,912 --> 01:01:08,960
9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20

1064
01:01:08,960 --> 01:01:19,060
21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36,

1065
01:01:19,060 --> 01:01:22,630
37, 38, 39, 40, 41, 42.

1066
01:01:22,630 --> 01:01:24,360
That is a record.

1067
01:01:24,360 --> 01:01:29,230
So it took 42 people
to get a match.

1068
01:01:29,230 --> 01:01:32,750
Now it turns out that
for N equals 365,

1069
01:01:32,750 --> 01:01:36,840
the magic number for N
is 23, that by 23 people,

1070
01:01:36,840 --> 01:01:39,180
we got a 50-50 chance.

1071
01:01:39,180 --> 01:01:44,260
In fact, the probability of a
match on 23 people is 0.506.

1072
01:01:44,260 --> 01:01:47,537
It's a little bit better
than 50-50 chance at 23.

1073
01:01:47,537 --> 01:01:48,870
Now, maybe we should figure out.

1074
01:01:48,870 --> 01:01:50,536
It's too late for
homework to figure out

1075
01:01:50,536 --> 01:01:53,340
what the chances are of going
this long without a match.

1076
01:01:53,340 --> 01:01:55,910
That maybe worth
figuring that out.

1077
01:01:55,910 --> 01:01:58,830
Now, it may seem
surprising at first

1078
01:01:58,830 --> 01:02:01,530
that 23 people is
enough to have a 50/50

1079
01:02:01,530 --> 01:02:06,000
chance because the chance of
any pair matching is 1 in 365,

1080
01:02:06,000 --> 01:02:07,770
by our assumption.

1081
01:02:07,770 --> 01:02:12,330
And that's small, but there's
lots of pairs of people

1082
01:02:12,330 --> 01:02:15,950
and every pair of people
have a chance to match

1083
01:02:15,950 --> 01:02:20,420
and that's why 23 turns out
to be enough to get to 50-50.

1084
01:02:20,420 --> 01:02:23,790
Now, we're going to do the
analysis for general M and N

1085
01:02:23,790 --> 01:02:26,490
to the figure out the
probability of a match

1086
01:02:26,490 --> 01:02:31,230
if there's M people
and N birthdays.

1087
01:02:31,230 --> 01:02:32,990
There's lots of ways to do it.

1088
01:02:36,140 --> 01:02:41,220
The easiest is to sort of well,
we'll draw the sample space.

1089
01:02:41,220 --> 01:02:43,110
It will be too big to
draw the whole thing,

1090
01:02:43,110 --> 01:02:46,080
but we can sort of
model the sample space

1091
01:02:46,080 --> 01:02:47,538
and then look at
the sample points.

1092
01:03:01,580 --> 01:03:08,710
So you've got the first person
and there's N birthdays here,

1093
01:03:08,710 --> 01:03:14,630
so it could be anywhere from
January 1 out to December 31

1094
01:03:14,630 --> 01:03:18,480
and in general this
will be N. And then you

1095
01:03:18,480 --> 01:03:29,280
have the second person and
they have N possibilities

1096
01:03:29,280 --> 01:03:31,510
for their birthday.

1097
01:03:31,510 --> 01:03:38,400
And you take the tree down M
levels to the very last person

1098
01:03:38,400 --> 01:03:38,900
here.

1099
01:03:45,920 --> 01:03:50,530
So each node has degree N and
there's M levels on this tree.

1100
01:03:50,530 --> 01:04:03,560
So the sample space is the set
of all n-tuples b1, b2, to bm,

1101
01:04:03,560 --> 01:04:10,880
these are the birthdays
where every value of bi

1102
01:04:10,880 --> 01:04:17,270
is between 1 and N. So a sample
point is all the birthdays

1103
01:04:17,270 --> 01:04:18,925
of the M people.

1104
01:04:22,180 --> 01:04:24,095
How many sample
points are there here?

1105
01:04:28,460 --> 01:04:32,610
Remember how to
count these things?

1106
01:04:32,610 --> 01:04:35,739
Number of leaves on an
N-ary tree of depth M or you

1107
01:04:35,739 --> 01:04:36,780
can think of it this way.

1108
01:04:36,780 --> 01:04:41,950
I've got N choices for each
bi and there's M of them.

1109
01:04:41,950 --> 01:04:42,884
AUDIENCE: [INAUDIBLE].

1110
01:04:42,884 --> 01:04:45,050
TOM LEIGHTON: So what's the
number of sample points?

1111
01:04:45,050 --> 01:04:47,020
AUDIENCE: N to the M.

1112
01:04:47,020 --> 01:04:54,720
TOM LEIGHTON: N to the M.
Because N choices here,

1113
01:04:54,720 --> 01:04:57,820
N choices here, N
choices there, so you

1114
01:04:57,820 --> 01:05:02,220
have N times N times N M times.

1115
01:05:02,220 --> 01:05:07,160
And what's the probability
of each outcome?

1116
01:05:07,160 --> 01:05:09,645
For a set of possible birthdays,
what's its probability?

1117
01:05:13,460 --> 01:05:19,160
What's the probability
of b1, b2, bM?

1118
01:05:27,409 --> 01:05:28,950
So the probability
of a sample point.

1119
01:05:28,950 --> 01:05:31,491
What's the probability that the
first person has birthday b1,

1120
01:05:31,491 --> 01:05:36,810
the second has b2,
and the N-th has bM?

1121
01:05:36,810 --> 01:05:37,546
Remember that?

1122
01:05:37,546 --> 01:05:38,046
Yeah.

1123
01:05:38,046 --> 01:05:39,330
AUDIENCE: 1 over N to the M.

1124
01:05:39,330 --> 01:05:42,410
TOM LEIGHTON: 1 over N to
the M because each edge

1125
01:05:42,410 --> 01:05:46,309
is probability of 1
over N and the paths

1126
01:05:46,309 --> 01:05:48,600
are length M, so you've got
1 over N to the M-th power.

1127
01:05:52,770 --> 01:05:55,500
Probability of the first
birthday matching is 1 in N

1128
01:05:55,500 --> 01:06:01,420
times 1 in N times 1 in N.
And this actually makes sense

1129
01:06:01,420 --> 01:06:04,890
because I've got N to
the M sample points,

1130
01:06:04,890 --> 01:06:08,000
each a probability
1 over N to the M.

1131
01:06:08,000 --> 01:06:11,920
So they all add up
to 1, which is good.

1132
01:06:11,920 --> 01:06:13,550
What kind of sample
space is this

1133
01:06:13,550 --> 01:06:15,530
where this happens where all
the probabilities are the same?

1134
01:06:15,530 --> 01:06:16,302
AUDIENCE: Uniform.

1135
01:06:16,302 --> 01:06:17,580
TOM LEIGHTON: Uniform.

1136
01:06:17,580 --> 01:06:19,080
Makes it very easy to work with.

1137
01:06:19,080 --> 01:06:21,240
All we got to do
now is just count

1138
01:06:21,240 --> 01:06:22,940
the number of
sample points where

1139
01:06:22,940 --> 01:06:27,130
there's a matching birthday and
then we multiply by that one

1140
01:06:27,130 --> 01:06:30,830
probability 1 over N to the M.

1141
01:06:30,830 --> 01:06:33,780
Now, it turns out that
rather than counting

1142
01:06:33,780 --> 01:06:35,500
the number of sample
points where there's

1143
01:06:35,500 --> 01:06:38,920
a matching birthday,
it's easier to count

1144
01:06:38,920 --> 01:06:41,430
the number of sample points
for all the birthdays are

1145
01:06:41,430 --> 01:06:43,095
different.

1146
01:06:43,095 --> 01:06:45,450
And this is often the case
when you're doing a counting

1147
01:06:45,450 --> 01:06:49,740
problem, it's easier
to count the opposite

1148
01:06:49,740 --> 01:06:52,100
of what you're after.

1149
01:06:52,100 --> 01:06:54,420
That can be the case
and it is the case here.

1150
01:06:54,420 --> 01:06:56,260
So we're going to do that.

1151
01:06:59,680 --> 01:07:06,600
So let's count how many
sample points are all

1152
01:07:06,600 --> 01:07:12,130
different birthdays, so no
pair of bi's is the same.

1153
01:07:12,130 --> 01:07:13,180
Let's do that.

1154
01:07:13,180 --> 01:07:15,724
How many choices
are there for b1?

1155
01:07:15,724 --> 01:07:20,480
365 or N. Let's do
this in terms of N

1156
01:07:20,480 --> 01:07:22,400
because we're going to
use this for general N.

1157
01:07:22,400 --> 01:07:25,470
How many choices for b2?

1158
01:07:25,470 --> 01:07:26,360
N minus 1.

1159
01:07:26,360 --> 01:07:28,950
Given you are the first
one, you can't match it.

1160
01:07:28,950 --> 01:07:33,570
And then N minus 2 all the
way over to the last one

1161
01:07:33,570 --> 01:07:37,360
is N minus M plus 1.

1162
01:07:37,360 --> 01:07:41,050
And this is a formula
you should all remember.

1163
01:07:41,050 --> 01:07:46,430
That's just N factorial
over N minus M factorial.

1164
01:07:46,430 --> 01:07:49,300
You did this sort of stuff a
couple weeks ago with counting

1165
01:07:49,300 --> 01:07:54,250
sets and probability is really--
a lot of it's about counting.

1166
01:07:54,250 --> 01:07:56,280
So now we can compute
the probability

1167
01:07:56,280 --> 01:07:57,990
that all the birthdays
are different.

1168
01:08:04,630 --> 01:08:08,660
It's just adding up all the
sample points of which there's

1169
01:08:08,660 --> 01:08:14,080
n factorial over N minus
M factorial and multiply

1170
01:08:14,080 --> 01:08:16,630
by the probability
of each one, which is

1171
01:08:16,630 --> 01:08:22,430
1 over N to the M. All right.

1172
01:08:22,430 --> 01:08:24,630
So we've actually now
answered the question.

1173
01:08:24,630 --> 01:08:28,670
This is the probability that
all the birthdays are different.

1174
01:08:28,670 --> 01:08:32,560
The only problem is,
it's not so clear

1175
01:08:32,560 --> 01:08:34,720
what the answer is to
actually compute this

1176
01:08:34,720 --> 01:08:37,250
or how fast it grows.

1177
01:08:37,250 --> 01:08:40,410
So if I wanted to get
a closed form for this

1178
01:08:40,410 --> 01:08:42,979
without the factorials,
what do I do?

1179
01:08:42,979 --> 01:08:45,200
What do I use?

1180
01:08:45,200 --> 01:08:48,310
Stirling's formula.

1181
01:08:48,310 --> 01:08:49,460
So let's remember that.

1182
01:08:57,630 --> 01:09:01,870
It says that N factorial
is asymptotically equal

1183
01:09:01,870 --> 01:09:09,830
to square root 2 pi N
times N over e to the N.

1184
01:09:09,830 --> 01:09:15,529
And that is accurate within
0.1% when N is at least 100.

1185
01:09:15,529 --> 01:09:18,240
So not only is it
asymptotically equal,

1186
01:09:18,240 --> 01:09:25,020
it's right on track for
a reasonable size N.

1187
01:09:25,020 --> 01:09:27,899
Now, I won't drag you
through all the calculations.

1188
01:09:27,899 --> 01:09:30,850
I used to actually try
plugging that formula

1189
01:09:30,850 --> 01:09:33,024
in for here and
here and then going

1190
01:09:33,024 --> 01:09:35,440
through all the calculations,
but we won't do it in class.

1191
01:09:35,440 --> 01:09:37,300
It's in the text.

1192
01:09:37,300 --> 01:09:40,102
But I will tell you
where that winds up.

1193
01:09:40,102 --> 01:09:42,310
It's not hard, you've just
got to do the calculation.

1194
01:09:48,160 --> 01:09:51,430
So this is means the probability
that all birthdays are

1195
01:09:51,430 --> 01:10:04,830
different turns out to be
asymptotically equal to e

1196
01:10:04,830 --> 01:10:15,530
to the N minus M plus 1/2 times
the natural log of N over N

1197
01:10:15,530 --> 01:10:23,670
minus M minus M. And that's
accurate to within 0.2%,

1198
01:10:23,670 --> 01:10:25,950
if N and N minus M are
large, larger than 100.

1199
01:10:25,950 --> 01:10:29,830
So in fact, it's almost equal.

1200
01:10:29,830 --> 01:10:36,650
And now you could plug in N
equals 365 and M equals 100.

1201
01:10:36,650 --> 01:10:41,950
So if you do that, in fact,
if somebody has a calculator,

1202
01:10:41,950 --> 01:10:45,950
we should plug in,
what do we have, 42.

1203
01:10:45,950 --> 01:10:48,020
You should plug in
M equals 42 and see

1204
01:10:48,020 --> 01:10:50,300
what the probability is.

1205
01:10:50,300 --> 01:10:56,330
But if M is 100, the chance
that we're all different,

1206
01:10:56,330 --> 01:11:05,150
this equals 3.07 dot, dot,
dot times 10 to the minus 7.

1207
01:11:05,150 --> 01:11:07,720
And we should check
for M equals 42.

1208
01:11:07,720 --> 01:11:10,080
My guess is it's pretty
small, but I don't know.

1209
01:11:10,080 --> 01:11:11,351
We'll have to check that.

1210
01:11:11,351 --> 01:11:14,648
AUDIENCE: 0.0859.

1211
01:11:14,648 --> 01:11:15,540
TOM LEIGHTON: Great.

1212
01:11:15,540 --> 01:11:22,190
So a 9% chance of having 42
people all miss is a 9% chance.

1213
01:11:22,190 --> 01:11:23,480
So we were little unlucky.

1214
01:11:23,480 --> 01:11:25,770
That won't happen very often.

1215
01:11:25,770 --> 01:11:29,720
But when you go from 42 to
100, it gets really small.

1216
01:11:29,720 --> 01:11:31,020
1 in 3 million or so.

1217
01:11:33,870 --> 01:11:39,840
Now, if N is 365 and M
is 23, the probability

1218
01:11:39,840 --> 01:11:44,190
comes out to be about
0.49, so about 50-50,

1219
01:11:44,190 --> 01:11:46,094
they're all different.

1220
01:11:50,540 --> 01:11:51,040
Now.

1221
01:11:51,040 --> 01:11:54,210
For general M and
N, we'd like to know

1222
01:11:54,210 --> 01:11:56,640
when do you get to
the 50-50 point?

1223
01:11:56,640 --> 01:12:00,550
We'd like to derive an
equation for M in terms of N

1224
01:12:00,550 --> 01:12:04,350
where the probability of being
all different is about 1/2.

1225
01:12:04,350 --> 01:12:04,850
All right.

1226
01:12:04,850 --> 01:12:05,558
So let's do that.

1227
01:12:16,270 --> 01:12:21,490
So as long as we assume--
and this will turn out

1228
01:12:21,490 --> 01:12:24,830
to be true-- that M is a
little o of N to the 2/3

1229
01:12:24,830 --> 01:12:30,190
and remember little o means it
grows slower than N to the 2/3.

1230
01:12:30,190 --> 01:12:32,230
Then we can simplify
that expression

1231
01:12:32,230 --> 01:12:35,240
in asymptotic notation.

1232
01:12:35,240 --> 01:12:40,405
And when you do it, I won't
drag it through on the board.

1233
01:12:40,405 --> 01:12:45,030
It's also in the text, it
turns out to be much simpler.

1234
01:12:45,030 --> 01:12:49,900
It's just e to the
minus M squared over 2N.

1235
01:12:49,900 --> 01:12:51,820
So I take that
thing up there and I

1236
01:12:51,820 --> 01:12:57,400
assume that M is growing less
fast than the 2/3 power of N

1237
01:12:57,400 --> 01:12:59,500
and that whole upper
expression reduces down

1238
01:12:59,500 --> 01:13:01,110
to M squared over 2N.

1239
01:13:01,110 --> 01:13:04,310
Everything else goes
to 0 in the exponent.

1240
01:13:04,310 --> 01:13:06,260
Doesn't matter.

1241
01:13:06,260 --> 01:13:11,970
Now, if I set this
to be 1/2, I can

1242
01:13:11,970 --> 01:13:17,610
solve this to find out what M
has to be to make that be 1/2.

1243
01:13:17,610 --> 01:13:18,110
All right.

1244
01:13:18,110 --> 01:13:24,710
So this will be true if and
only if minus M squared over 2N

1245
01:13:24,710 --> 01:13:26,270
is equal to the
natural log of 1/2.

1246
01:13:29,610 --> 01:13:31,040
And that's true.

1247
01:13:31,040 --> 01:13:34,820
Take the minus sign, put it
inside to make a log of 2,

1248
01:13:34,820 --> 01:13:36,580
multiply by 2N.

1249
01:13:36,580 --> 01:13:43,590
That's true if M squared
equals 2N natural log of 2.

1250
01:13:43,590 --> 01:13:46,800
And now I can solve
for M really easily.

1251
01:13:46,800 --> 01:13:53,090
That's true if and
only if M equals

1252
01:13:53,090 --> 01:13:58,360
the square root of 2
natural log of 2N, which

1253
01:13:58,360 --> 01:14:08,560
is about 1.177 square root
of N. So for general N,

1254
01:14:08,560 --> 01:14:14,020
you get a 50% probability of
having a matching birthday when

1255
01:14:14,020 --> 01:14:21,920
M is in this range, pretty
close to 1.2 square root of N.

1256
01:14:21,920 --> 01:14:25,024
Now, this square root N
phenomenon, this thing here,

1257
01:14:25,024 --> 01:14:26,940
that's what's known as
the birthday principle.

1258
01:14:29,630 --> 01:14:33,250
It says if you've got roughly
square root of N randomly

1259
01:14:33,250 --> 01:14:39,580
allocated items into N
boxes or bins or birthdays,

1260
01:14:39,580 --> 01:14:41,940
there's a decent
chance two of the items

1261
01:14:41,940 --> 01:14:47,120
will go into the same bin
if the randomly allocated.

1262
01:14:47,120 --> 01:14:49,130
In this case, the bins
are the possible days

1263
01:14:49,130 --> 01:14:53,150
of the year that we put each
person into for their birthday.

1264
01:14:53,150 --> 01:14:55,525
Any questions about that?

1265
01:14:58,860 --> 01:14:59,360
Yeah.

1266
01:14:59,360 --> 01:15:01,328
AUDIENCE: M and N
are like numbers

1267
01:15:01,328 --> 01:15:03,460
like they're defined up
there or does it mean

1268
01:15:03,460 --> 01:15:05,756
to say M equals [INAUDIBLE]?

1269
01:15:05,756 --> 01:15:06,980
TOM LEIGHTON: Yeah.

1270
01:15:06,980 --> 01:15:10,370
So here I looked at a
special case where N was 365,

1271
01:15:10,370 --> 01:15:13,360
M was 100, but we
can imagine them

1272
01:15:13,360 --> 01:15:17,350
as arbitrary numbers that
could be getting large.

1273
01:15:17,350 --> 01:15:22,480
And so over here and I say M
is little o of N to the 2/3,

1274
01:15:22,480 --> 01:15:26,250
I mean, well, M equals square
root of N would qualify.

1275
01:15:26,250 --> 01:15:29,950
Square root of N is
little o of N to the 2/3.

1276
01:15:29,950 --> 01:15:33,320
So as long as M is
not growing too fast,

1277
01:15:33,320 --> 01:15:36,950
I can simplify that expression
up there, which is what I did.

1278
01:15:36,950 --> 01:15:40,890
And then we go back
and we find, in fact,

1279
01:15:40,890 --> 01:15:42,980
the square root of
N the right answer

1280
01:15:42,980 --> 01:15:45,650
and that is little
o of N to the 2/3.

1281
01:15:45,650 --> 01:15:48,400
And I have to use a
different argument

1282
01:15:48,400 --> 01:15:51,750
if I assumed M was
bigger, which I didn't do.

1283
01:15:51,750 --> 01:15:53,140
I didn't drag it for that.

1284
01:15:53,140 --> 01:15:56,340
But I would have to
go check that case.

1285
01:15:56,340 --> 01:15:58,320
So we can think of general
is M and N as being

1286
01:15:58,320 --> 01:16:00,730
arbitrary variables and
potentially growing.

1287
01:16:00,730 --> 01:16:03,710
M can be a function
of N. And in fact,

1288
01:16:03,710 --> 01:16:06,170
when M is the square
root function of N, then

1289
01:16:06,170 --> 01:16:07,830
we got a 50% chance of a match.

1290
01:16:11,080 --> 01:16:15,420
Now, the birthday
principle comes up all

1291
01:16:15,420 --> 01:16:18,700
over the place in
computer science

1292
01:16:18,700 --> 01:16:20,215
and it's worth remembering.

1293
01:16:23,140 --> 01:16:26,286
For example, the
generic form for this

1294
01:16:26,286 --> 01:16:27,660
is when you have
a hash function.

1295
01:16:30,510 --> 01:16:34,040
Let's say I have a
hash function, h,

1296
01:16:34,040 --> 01:16:39,280
from a large set of items
into a small set of items.

1297
01:16:39,280 --> 01:16:43,200
For example, say I'm
computing digital signatures.

1298
01:16:43,200 --> 01:16:45,640
This is the space
of all messages,

1299
01:16:45,640 --> 01:16:49,030
this is the space of all
1,000-bit digital signatures,

1300
01:16:49,030 --> 01:16:51,750
and h is a digital
signature outcome.

1301
01:16:51,750 --> 01:16:53,720
Say I'm doing
memory allocations.

1302
01:16:53,720 --> 01:16:57,190
So all the things I might
be sticking into a register,

1303
01:16:57,190 --> 01:16:58,640
here's all the
places it could go.

1304
01:16:58,640 --> 01:17:00,360
Here's all the registers.

1305
01:17:00,360 --> 01:17:01,760
Error checking.

1306
01:17:01,760 --> 01:17:04,490
This is all the garbled
messages in the world.

1307
01:17:04,490 --> 01:17:07,360
This is the set of
messages that make sense,

1308
01:17:07,360 --> 01:17:12,770
all handled by functions,
random kind of functions often.

1309
01:17:12,770 --> 01:17:17,830
Now, what you worry about when
you're hashing is collisions.

1310
01:17:17,830 --> 01:17:20,240
Let me define that.

1311
01:17:20,240 --> 01:17:29,840
We say that x collides
with y if the hash of x

1312
01:17:29,840 --> 01:17:33,710
equals the hash of y, but
x and y are different.

1313
01:17:36,650 --> 01:17:39,930
For example, say you're
looking at digital signatures.

1314
01:17:39,930 --> 01:17:43,905
You would not want the
signature for a $100 check

1315
01:17:43,905 --> 01:17:49,692
to your mom to match your
signature for $100,000 check

1316
01:17:49,692 --> 01:17:50,430
to Boris.

1317
01:17:52,717 --> 01:17:54,550
Because that would be
bad because then Boris

1318
01:17:54,550 --> 01:17:58,590
could come in and take that
check to your mom for $100,

1319
01:17:58,590 --> 01:18:01,960
converted to a
$100,000 check to him

1320
01:18:01,960 --> 01:18:04,370
and the signature is
authentic if there's

1321
01:18:04,370 --> 01:18:06,040
a collision in the signatures.

1322
01:18:06,040 --> 01:18:08,200
So very important when
you're doing hash functions

1323
01:18:08,200 --> 01:18:11,850
and in many applications,
you don't want collisions

1324
01:18:11,850 --> 01:18:14,160
because all the whole
thing start breaking.

1325
01:18:14,160 --> 01:18:15,020
Memory allocation.

1326
01:18:15,020 --> 01:18:17,770
You don't want to assign two
things in the same place.

1327
01:18:17,770 --> 01:18:18,880
Error correction.

1328
01:18:18,880 --> 01:18:23,100
There's only one answer you
want to get out at the end.

1329
01:18:23,100 --> 01:18:24,810
Now, from the pigeon
hole principle,

1330
01:18:24,810 --> 01:18:27,525
you know if this set is
bigger than that set,

1331
01:18:27,525 --> 01:18:28,900
there is going to
be a collision.

1332
01:18:28,900 --> 01:18:30,720
That's what the pigeon
hole principle says.

1333
01:18:30,720 --> 01:18:33,490
Two guys will get mapped
to the same thing.

1334
01:18:33,490 --> 01:18:38,270
However, often in practice what
we care about is a subset L

1335
01:18:38,270 --> 01:18:43,330
prime of L that's pretty small
because the set of messages we

1336
01:18:43,330 --> 01:18:46,370
really assign is pretty small
compared to all 1,000-bit

1337
01:18:46,370 --> 01:18:48,370
signatures that are possible.

1338
01:18:48,370 --> 01:18:51,580
And what you'd like is that for
this smaller set of messages,

1339
01:18:51,580 --> 01:18:56,210
you might want to assign, they
all get mapped one to one.

1340
01:18:56,210 --> 01:19:01,560
And the birthday principle
says life is not so nice.

1341
01:19:01,560 --> 01:19:04,391
So let me write that
down then we'll be done.

1342
01:19:10,592 --> 01:19:11,560
All right.

1343
01:19:11,560 --> 01:19:29,870
So the birthday principle says
that if S is at least 100,

1344
01:19:29,870 --> 01:19:36,590
L prime is a subset of L that is
at least the square root of S.

1345
01:19:36,590 --> 01:19:39,740
So the cardinality of the
things you want to hash

1346
01:19:39,740 --> 01:19:45,800
is bigger than 1.2 square
root the cardinality of S.

1347
01:19:45,800 --> 01:19:57,290
And if the values of the
function h on L prime

1348
01:19:57,290 --> 01:20:14,260
are randomly chosen, uniform,
and mutually independent,

1349
01:20:14,260 --> 01:20:16,630
then there's at least
a 50% chance, so

1350
01:20:16,630 --> 01:20:24,310
with probability at least
1/2, there's a collision.

1351
01:20:24,310 --> 01:20:29,340
There exists an x and a y
such that x does not equal y--

1352
01:20:29,340 --> 01:20:37,621
and these are in L prime--
but h of x equals h of y.

1353
01:20:37,621 --> 01:20:38,120
All right.

1354
01:20:38,120 --> 01:20:41,910
The proof is not hard, it's
just we more or less did it.

1355
01:20:41,910 --> 01:20:44,880
You just plug in the
cardinality of L prime for M

1356
01:20:44,880 --> 01:20:48,680
and the cardinality of S
for N. And it's bad news

1357
01:20:48,680 --> 01:20:52,450
because it means it doesn't
take very many messages,

1358
01:20:52,450 --> 01:20:57,940
just square root the number of
signatures to get a collision.

1359
01:20:57,940 --> 01:21:01,422
You'd hope you could get that
you could have L prime be

1360
01:21:01,422 --> 01:21:02,880
as big as S and
that somehow they'd

1361
01:21:02,880 --> 01:21:05,637
all go one to one, that
everybody in this room

1362
01:21:05,637 --> 01:21:06,970
would have a different birthday.

1363
01:21:06,970 --> 01:21:09,380
That is not how it works
if things are random,

1364
01:21:09,380 --> 01:21:11,750
which is the case you
usually like to have.

1365
01:21:11,750 --> 01:21:16,000
Now, this technique is used to
crack cryptographic protocols

1366
01:21:16,000 --> 01:21:18,720
and it's called the birthday
attack based on the birthday

1367
01:21:18,720 --> 01:21:19,474
principle.

1368
01:21:19,474 --> 01:21:20,890
So what you do is,
you get a bunch

1369
01:21:20,890 --> 01:21:25,190
of messages that are
encrypted and pretty soon you

1370
01:21:25,190 --> 01:21:28,350
find two that get maybe
encrypted the same way.

1371
01:21:28,350 --> 01:21:30,700
And once you have that,
now you can go back

1372
01:21:30,700 --> 01:21:33,640
and crack the crypto system.

1373
01:21:33,640 --> 01:21:37,680
For example, you break schemes
like RSA with a birthday attack

1374
01:21:37,680 --> 01:21:40,630
if this space is not
big enough and that's

1375
01:21:40,630 --> 01:21:46,310
one reason why now RSA, the
keys have thousands of digits

1376
01:21:46,310 --> 01:21:48,720
because otherwise you
can use attacks like this

1377
01:21:48,720 --> 01:21:52,450
and crack them more easily.

1378
01:21:52,450 --> 01:21:55,610
Any questions about that?

1379
01:21:55,610 --> 01:21:56,110
OK.

1380
01:21:56,110 --> 01:21:56,610
Very good.

1381
01:21:56,610 --> 01:21:58,710
We're done for today.