1
00:00:01,170 --> 00:00:03,510
The following content is
provided under a Creative

2
00:00:03,510 --> 00:00:04,930
Commons license.

3
00:00:04,930 --> 00:00:07,120
Your support will help
MIT OpenCourseWare

4
00:00:07,120 --> 00:00:11,230
continue to offer high-quality
educational resources for free.

5
00:00:11,230 --> 00:00:13,770
To make a donation or to
view additional materials

6
00:00:13,770 --> 00:00:17,730
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,730 --> 00:00:18,610
at ocw.mit.edu.

8
00:00:23,542 --> 00:00:25,750
GABRIEL SANCHEZ-MARTINEZ:
Any questions on Homework 1

9
00:00:25,750 --> 00:00:26,680
before we get started?

10
00:00:29,308 --> 00:00:30,190
AUDIENCE: Yeah.

11
00:00:30,190 --> 00:00:33,100
GABRIEL SANCHEZ-MARTINEZ:
OK, fire away.

12
00:00:33,100 --> 00:00:36,940
AUDIENCE: I guess,
first, do you think

13
00:00:36,940 --> 00:00:39,400
we have like this
minimum cycle time,

14
00:00:39,400 --> 00:00:42,360
like a theoretical minimum cycle
time and then what was actually

15
00:00:42,360 --> 00:00:45,630
[INAUDIBLE] cycle time?

16
00:00:45,630 --> 00:00:49,890
GABRIEL SANCHEZ-MARTINEZ: So
cycle time, just to review--

17
00:00:49,890 --> 00:00:54,540
it's the time that
it takes a bus to--

18
00:00:54,540 --> 00:00:57,620
from the time
[AUDIO OUT] for a trip.

19
00:00:57,620 --> 00:01:01,530
It goes all the way one way,
has to wait at the other end

20
00:01:01,530 --> 00:01:05,010
to recover the schedule,
comes back, waits to recover,

21
00:01:05,010 --> 00:01:07,290
and is ready to
begin the next round.

22
00:01:07,290 --> 00:01:09,890
So that's a cycle.

23
00:01:09,890 --> 00:01:14,040
AUDIENCE: Since you have
[INAUDIBLE] going on,

24
00:01:14,040 --> 00:01:17,579
if you had 4.1 buses,
then you use a cycle time.

25
00:01:17,579 --> 00:01:18,995
Then obviously,
you can't do that?

26
00:01:18,995 --> 00:01:19,930
[INTERPOSING VOICES]

27
00:01:19,930 --> 00:01:21,315
GABRIEL SANCHEZ-MARTINEZ: So
you would need five buses--

28
00:01:21,315 --> 00:01:21,610
AUDIENCE: Yeah.

29
00:01:21,610 --> 00:01:23,860
GABRIEL SANCHEZ-MARTINEZ:
--if that's what you've got.

30
00:01:23,860 --> 00:01:26,892
Or you would have to do a
trade-off with reliability

31
00:01:26,892 --> 00:01:27,850
if that were to happen.

32
00:01:31,650 --> 00:01:33,312
AUDIENCE: I think
most of my questions

33
00:01:33,312 --> 00:01:35,440
were on this very last
couple of questions.

34
00:01:35,440 --> 00:01:38,368
GABRIEL SANCHEZ-MARTINEZ: Yeah.

35
00:01:38,368 --> 00:01:41,680
AUDIENCE: We were aggregating
a bunch of data for--

36
00:01:41,680 --> 00:01:45,094
[INAUDIBLE] you did it
across both directions

37
00:01:45,094 --> 00:01:46,510
and then asked,
how does it change

38
00:01:46,510 --> 00:01:49,650
when you would like to evaluate
each direction separately

39
00:01:49,650 --> 00:01:51,062
in layover time?

40
00:01:51,062 --> 00:01:53,520
GABRIEL SANCHEZ-MARTINEZ: This
is the penultimate question,

41
00:01:53,520 --> 00:01:53,880
correct?

42
00:01:53,880 --> 00:01:54,120
AUDIENCE: Yeah.

43
00:01:54,120 --> 00:01:54,510
GABRIEL SANCHEZ-MARTINEZ:
So that's

44
00:01:54,510 --> 00:01:55,920
the hardest question
on the assignment.

45
00:01:55,920 --> 00:01:56,670
AUDIENCE: OK.

46
00:01:56,670 --> 00:01:58,836
GABRIEL SANCHEZ-MARTINEZ:
It is a challenge question

47
00:01:58,836 --> 00:02:02,430
because there are different
cases that you have to analyze.

48
00:02:02,430 --> 00:02:05,330
That's maybe the hint, right?

49
00:02:05,330 --> 00:02:07,330
There are some cases.

50
00:02:07,330 --> 00:02:09,537
And for each case,
there is a probability

51
00:02:09,537 --> 00:02:10,620
that that case will occur.

52
00:02:10,620 --> 00:02:11,830
AUDIENCE: Yeah.

53
00:02:11,830 --> 00:02:22,334
GABRIEL SANCHEZ-MARTINEZ: And--
let's see if this starts--

54
00:02:22,334 --> 00:02:23,750
there's a probability
that it will

55
00:02:23,750 --> 00:02:30,090
occur and then a consequence, or
something happens in that case.

56
00:02:30,090 --> 00:02:32,990
So you have to look at each case
and then aggregate the cases

57
00:02:32,990 --> 00:02:34,880
together, if that make sense.

58
00:02:34,880 --> 00:02:36,146
AUDIENCE: Yes.

59
00:02:36,146 --> 00:02:38,770
GABRIEL SANCHEZ-MARTINEZ: We're
taking questions for Assignment

60
00:02:38,770 --> 00:02:40,090
1, which is due on Thursday.

61
00:02:43,740 --> 00:02:44,880
Any other questions?

62
00:02:44,880 --> 00:02:47,565
AUDIENCE: That's it.

63
00:02:47,565 --> 00:02:48,440
AUDIENCE: [INAUDIBLE]

64
00:02:48,440 --> 00:02:51,200
GABRIEL SANCHEZ-MARTINEZ:
It is due at 4:00

65
00:02:51,200 --> 00:02:54,620
so at class time
essentially, yeah.

66
00:02:54,620 --> 00:02:56,410
I actually [AUDIO OUT]
if you 4:00.

67
00:02:56,410 --> 00:02:58,960
I said 4:05, so you
have five minutes.

68
00:03:01,876 --> 00:03:05,035
AUDIENCE: Can you [INAUDIBLE]
what assumptions there

69
00:03:05,035 --> 00:03:06,250
are [INAUDIBLE]?

70
00:03:10,485 --> 00:03:12,276
GABRIEL SANCHEZ-MARTINEZ:
In what question?

71
00:03:12,276 --> 00:03:14,463
AUDIENCE: When you
said it seems to be

72
00:03:14,463 --> 00:03:17,622
the reasoning or assumption
about the schedule [INAUDIBLE]??

73
00:03:17,622 --> 00:03:20,052
Which metric do you use?

74
00:03:20,052 --> 00:03:22,757
Based on the data,
which [INAUDIBLE]??

75
00:03:22,757 --> 00:03:25,340
GABRIEL SANCHEZ-MARTINEZ: Yeah,
so that's Question 3, correct?

76
00:03:25,340 --> 00:03:25,965
AUDIENCE: Yeah.

77
00:03:25,965 --> 00:03:28,887
GABRIEL SANCHEZ-MARTINEZ:
So I can't really explain.

78
00:03:28,887 --> 00:03:30,720
I can't give you the
answer to the question.

79
00:03:30,720 --> 00:03:34,250
So what I'm looking for
there is your intuition

80
00:03:34,250 --> 00:03:38,870
and your understanding of why
you would pick which statistics

81
00:03:38,870 --> 00:03:45,240
from Question 2, where it tells
you calculate all these things.

82
00:03:45,240 --> 00:03:48,500
Now I'm saying pick
from those statistics

83
00:03:48,500 --> 00:03:52,370
what you would use
for t and for r.

84
00:03:52,370 --> 00:03:54,770
And you may want to combine
different statistics

85
00:03:54,770 --> 00:03:57,450
for the computation of r.

86
00:03:57,450 --> 00:03:57,950
Yeah?

87
00:03:57,950 --> 00:04:02,132
AUDIENCE: [INAUDIBLE]
multiple valid responses but--

88
00:04:02,132 --> 00:04:04,590
GABRIEL SANCHEZ-MARTINEZ: Yes,
some more valid than others,

89
00:04:04,590 --> 00:04:06,510
but some that are
definitely invalid

90
00:04:06,510 --> 00:04:12,900
and some that are almost 100%
valid but not 100% valid.

91
00:04:12,900 --> 00:04:15,670
So there are several
correct answers,

92
00:04:15,670 --> 00:04:18,130
and some that are
very good answers

93
00:04:18,130 --> 00:04:21,220
because you can justify
the choice of the statistic

94
00:04:21,220 --> 00:04:23,310
conceptually.

95
00:04:23,310 --> 00:04:24,380
Yeah.

96
00:04:24,380 --> 00:04:26,920
Any other questions
on Homework 1?

97
00:04:26,920 --> 00:04:30,860
I can take some more questions
after class, if that's OK.

98
00:04:30,860 --> 00:04:36,260
So we had a snow day if you had
a good time, and/or at least,

99
00:04:36,260 --> 00:04:37,910
you could use it to catch up.

100
00:04:37,910 --> 00:04:40,430
So the schedule is a
little different now.

101
00:04:40,430 --> 00:04:43,260
I've posted an update about
that on Stellar (class site).

102
00:04:43,260 --> 00:04:44,930
There's a new syllabus.

103
00:04:44,930 --> 00:04:47,970
And we're going to do
some [AUDIO OUT] different

104
00:04:47,970 --> 00:04:49,890
[AUDIO OUT].

105
00:04:49,890 --> 00:04:53,370
You may remember that we have
three introductory classes

106
00:04:53,370 --> 00:04:56,370
on topics of [INAUDIBLE].

107
00:04:56,370 --> 00:04:59,140
And then, we had model
characteristics and roles.

108
00:04:59,140 --> 00:05:02,100
And then, [AUDIO OUT].

109
00:05:02,100 --> 00:05:03,788
We're going to
shuffle a little bit.

110
00:05:03,788 --> 00:05:08,770
[AUDIO OUT] Microphone working?

111
00:05:08,770 --> 00:05:13,360
So because the second assignment
is on data collection,

112
00:05:13,360 --> 00:05:14,770
we're going to cover that today.

113
00:05:14,770 --> 00:05:16,769
And we're going to give
you that homework today,

114
00:05:16,769 --> 00:05:19,810
so that you can get started
on the data collection side.

115
00:05:19,810 --> 00:05:23,080
Then, we're going to cover some
of the short-range [INAUDIBLE]

116
00:05:23,080 --> 00:05:24,670
of planning concepts.

117
00:05:24,670 --> 00:05:25,910
Nema is going to do that--

118
00:05:25,910 --> 00:05:26,480
Nema Nassir.

119
00:05:26,480 --> 00:05:29,540
You might recall him from
the previous lecture.

120
00:05:29,540 --> 00:05:33,370
And then, we'll finish
with [INAUDIBLE] and costs

121
00:05:33,370 --> 00:05:36,250
in March the 2nd, OK?

122
00:05:36,250 --> 00:05:39,860
So remember, there's no
class on Monday the 21st.

123
00:05:44,410 --> 00:05:46,400
AUDIENCE: You mean Tuesday?

124
00:05:46,400 --> 00:05:49,131
GABRIEL SANCHEZ-MARTINEZ:
Sorry, yes, Tuesday.

125
00:05:49,131 --> 00:05:50,630
I think, there's
no class on Monday.

126
00:05:50,630 --> 00:05:52,190
And then, Tuesday
there are classes.

127
00:05:52,190 --> 00:05:53,420
But it's Monday's schedule.

128
00:05:53,420 --> 00:05:55,250
So we don't have class.

129
00:05:55,250 --> 00:05:58,320
Thank you for bringing that up.

130
00:05:58,320 --> 00:05:59,630
OK.

131
00:05:59,630 --> 00:06:03,610
I'll leave Homework 2 for when
we finish with the lecture.

132
00:06:03,610 --> 00:06:06,370
But I'll distribute it later.

133
00:06:06,370 --> 00:06:09,240
So let's just get
started on that.

134
00:06:09,240 --> 00:06:11,430
So data collection techniques
and program design--

135
00:06:11,430 --> 00:06:13,999
that's the topic for today.

136
00:06:13,999 --> 00:06:14,790
Here's the outline.

137
00:06:14,790 --> 00:06:17,660
So we're going to cover a
summary of current practice

138
00:06:17,660 --> 00:06:18,747
quite quickly.

139
00:06:18,747 --> 00:06:21,330
Then, we're going to talk about
data collection program design

140
00:06:21,330 --> 00:06:25,050
process, the needs, the data
needs, the techniques for data

141
00:06:25,050 --> 00:06:26,301
collection, the sampling.

142
00:06:26,301 --> 00:06:28,050
We're going to get
into the details of how

143
00:06:28,050 --> 00:06:29,890
we get sample slices.

144
00:06:29,890 --> 00:06:32,730
And we're going to finish
with special considerations

145
00:06:32,730 --> 00:06:35,200
for surveys and
surveying techniques.

146
00:06:37,740 --> 00:06:38,610
so where are we?

147
00:06:38,610 --> 00:06:42,090
Where is the transit industry
in terms of data collection,

148
00:06:42,090 --> 00:06:44,370
and sampling, and these things?

149
00:06:44,370 --> 00:06:45,810
Largely, there's
been a transition

150
00:06:45,810 --> 00:06:48,047
from manual to automatic
data collection.

151
00:06:48,047 --> 00:06:50,130
As you might imagine, with
the internet of things,

152
00:06:50,130 --> 00:06:52,800
and sensors, and the
internet, and wireless,

153
00:06:52,800 --> 00:06:54,720
it used to be that
if you wanted to have

154
00:06:54,720 --> 00:06:56,100
statistics on your
running times,

155
00:06:56,100 --> 00:06:57,690
you had to send people out.

156
00:06:57,690 --> 00:06:59,880
We call those people checkers.

157
00:06:59,880 --> 00:07:03,330
And those checkers would
have notebooks and record

158
00:07:03,330 --> 00:07:05,400
running times, and number
of people boarding,

159
00:07:05,400 --> 00:07:06,660
and these things.

160
00:07:06,660 --> 00:07:09,950
Nowadays, with the modern
systems, especially

161
00:07:09,950 --> 00:07:13,920
the modern systems, we have
several sensors and types

162
00:07:13,920 --> 00:07:16,410
of sensors that collect
some of that data for us.

163
00:07:16,410 --> 00:07:20,085
So we're going to
cover both approaches.

164
00:07:20,085 --> 00:07:23,054
[INAUDIBLE] data
collection to supplement

165
00:07:23,054 --> 00:07:24,220
[INAUDIBLE] data collection.

166
00:07:24,220 --> 00:07:27,730
And if you happen to be
consulting for a developing

167
00:07:27,730 --> 00:07:32,560
country that is working with a
system that has not yet brought

168
00:07:32,560 --> 00:07:35,500
in automatic data
collection technologies,

169
00:07:35,500 --> 00:07:39,100
it's also useful to know
all about the manual design

170
00:07:39,100 --> 00:07:42,120
and manual data
collection process.

171
00:07:42,120 --> 00:07:44,710
[AUDIO OUT] took this
class and ended up

172
00:07:44,710 --> 00:07:47,980
working in large consulting
firms have gone off

173
00:07:47,980 --> 00:07:52,369
to help countries put
in new transit systems.

174
00:07:52,369 --> 00:07:54,160
And one of the first
things they have to do

175
00:07:54,160 --> 00:07:59,222
is back to these slides and see
what the plan is going to be,

176
00:07:59,222 --> 00:08:01,180
and how many people you
need, and how much it's

177
00:08:01,180 --> 00:08:01,930
going to cost.

178
00:08:01,930 --> 00:08:04,510
So very useful topic.

179
00:08:04,510 --> 00:08:07,519
So as I said, there's
automatic data collection.

180
00:08:07,519 --> 00:08:08,810
There's manual data collection.

181
00:08:08,810 --> 00:08:11,860
There's sometimes a mix of
data collection techniques.

182
00:08:11,860 --> 00:08:14,470
Often, what happens is
that we just send people

183
00:08:14,470 --> 00:08:15,970
out and collect data.

184
00:08:15,970 --> 00:08:19,420
Or we just extract a sample of
automatically collected data.

185
00:08:19,420 --> 00:08:21,970
And we don't really think about
sampling, and the confidence

186
00:08:21,970 --> 00:08:24,750
interval, and how sure
are we of that result

187
00:08:24,750 --> 00:08:27,490
that we're going to influence
policy or make decisions

188
00:08:27,490 --> 00:08:29,260
that will affect service.

189
00:08:29,260 --> 00:08:31,390
How sure are we of those?

190
00:08:31,390 --> 00:08:33,640
So statistical validity.

191
00:08:33,640 --> 00:08:37,179
Often, there's an
efficient use of data.

192
00:08:37,179 --> 00:08:41,919
And ADCS, which is Automatic
Data Collection Systems--

193
00:08:41,919 --> 00:08:44,260
we'll use that abbreviation
throughout the course-

194
00:08:44,260 --> 00:08:47,020
presents a major opportunity
for strengthening data

195
00:08:47,020 --> 00:08:48,260
to support decision making.

196
00:08:48,260 --> 00:08:49,790
We'll talk about
how that happens.

197
00:08:49,790 --> 00:08:52,520
Let's first compare manual
and automatic data collection.

198
00:08:52,520 --> 00:08:54,386
So what happens with
manual data collection?

199
00:08:54,386 --> 00:08:55,510
You hire people, as I said.

200
00:08:55,510 --> 00:08:56,950
You hired checkers.

201
00:08:56,950 --> 00:08:59,860
So initially, there's
no setup cost.

202
00:08:59,860 --> 00:09:01,950
There's a low
capital cost to that.

203
00:09:01,950 --> 00:09:04,210
But there's a high marginal
cost because if you

204
00:09:04,210 --> 00:09:06,680
want to collect more data,
you have to hire more people.

205
00:09:06,680 --> 00:09:08,134
Does that make sense?

206
00:09:08,134 --> 00:09:10,300
If you want to bring in an
automatic data collection

207
00:09:10,300 --> 00:09:12,341
system, you might have to
retrofit all your buses

208
00:09:12,341 --> 00:09:13,930
with AVL sensors.

209
00:09:13,930 --> 00:09:16,410
And that's going to
cost you initially.

210
00:09:16,410 --> 00:09:19,710
So that's a high
capital cost relatively.

211
00:09:19,710 --> 00:09:22,420
But low marginal cost-- once
you have those systems in place,

212
00:09:22,420 --> 00:09:24,160
they keep collecting
data for you.

213
00:09:24,160 --> 00:09:25,300
And it's almost free.

214
00:09:25,300 --> 00:09:27,760
You do need some maintenance
on these equipments.

215
00:09:27,760 --> 00:09:31,510
But comparing to
manual data collection,

216
00:09:31,510 --> 00:09:33,310
you have low marginal cost.

217
00:09:33,310 --> 00:09:35,770
Because of that marginal
cost difference,

218
00:09:35,770 --> 00:09:38,320
it tends to happen that when
you have manual data collection,

219
00:09:38,320 --> 00:09:41,920
you only pay checkers
for small sample sizes--

220
00:09:41,920 --> 00:09:43,300
just what you need.

221
00:09:43,300 --> 00:09:46,930
Whereas, once you put in
automatic data collection

222
00:09:46,930 --> 00:09:49,720
systems, they keep
collecting data.

223
00:09:49,720 --> 00:09:52,110
So you get much bigger data.

224
00:09:52,110 --> 00:09:53,950
Bless you.

225
00:09:53,950 --> 00:09:57,430
OK, in both cases, we can
collect data and analyze it

226
00:09:57,430 --> 00:09:59,860
for aggregate analysis
and disaggregate analysis.

227
00:09:59,860 --> 00:10:01,720
So you might want
passenger-specific data

228
00:10:01,720 --> 00:10:02,620
on things.

229
00:10:02,620 --> 00:10:06,400
Or you might want things
like just averages

230
00:10:06,400 --> 00:10:09,340
and aggregate things,
total number of passengers

231
00:10:09,340 --> 00:10:10,960
using the system.

232
00:10:10,960 --> 00:10:12,940
And when you're doing
manual data collection,

233
00:10:12,940 --> 00:10:14,890
you can look at
quantitative things, things

234
00:10:14,890 --> 00:10:16,440
you can measure and count.

235
00:10:16,440 --> 00:10:19,820
Or you can also observe
things qualitatively.

236
00:10:19,820 --> 00:10:22,090
One example that I
saw in a recent paper

237
00:10:22,090 --> 00:10:26,680
was considering the
[? therivation ?]

238
00:10:26,680 --> 00:10:28,719
by student in some country.

239
00:10:28,719 --> 00:10:30,760
And they didn't ask people
if they were students.

240
00:10:30,760 --> 00:10:32,582
They were looking at people's--

241
00:10:32,582 --> 00:10:33,790
more or less, are they young?

242
00:10:33,790 --> 00:10:35,410
Are they carrying a backpack?

243
00:10:35,410 --> 00:10:38,390
And that would be the
labeling for your student.

244
00:10:38,390 --> 00:10:42,010
So that's something that a
sensor might not do so well.

245
00:10:42,010 --> 00:10:44,270
Although now with machine
learning, who knows?

246
00:10:44,270 --> 00:10:45,890
But we haven't seen that so.

247
00:10:45,890 --> 00:10:48,580
So you can do
qualitative observations

248
00:10:48,580 --> 00:10:50,410
when you're doing
manual data collection.

249
00:10:50,410 --> 00:10:52,810
Manual data collection
tends to be unreliable,

250
00:10:52,810 --> 00:10:56,020
especially when people
aren't very well trained

251
00:10:56,020 --> 00:10:59,320
and when you have a group of
different people collecting

252
00:10:59,320 --> 00:10:59,830
data.

253
00:10:59,830 --> 00:11:01,621
So each person might
have different biases.

254
00:11:01,621 --> 00:11:05,020
It's hard to reproduce the
exact bias across persons.

255
00:11:05,020 --> 00:11:07,870
With automatic data
collection, you do the errors.

256
00:11:07,870 --> 00:11:10,450
And often, they
are not corrected.

257
00:11:10,450 --> 00:11:14,260
But if you do correct them,
and you estimate those biases

258
00:11:14,260 --> 00:11:18,550
just for them, you can end
up with a better result.

259
00:11:18,550 --> 00:11:21,280
Because of the small
sample sizes in manual data

260
00:11:21,280 --> 00:11:25,180
collection, you tend to
have to have limited spatial

261
00:11:25,180 --> 00:11:27,290
and temporal coverage of data.

262
00:11:27,290 --> 00:11:29,650
So for example, if you're
interested in ridership

263
00:11:29,650 --> 00:11:34,900
in the system, it's unlikely
that you will cover ridership

264
00:11:34,900 --> 00:11:38,650
in holidays for
[INAUDIBLE] system

265
00:11:38,650 --> 00:11:40,330
because there are
only a few holidays.

266
00:11:40,330 --> 00:11:44,350
And usually, you're not
mostly interested in holidays.

267
00:11:44,350 --> 00:11:48,160
So chances are, you won't have
data collection for holidays.

268
00:11:48,160 --> 00:11:50,320
Whereas once you install
automatic data collection

269
00:11:50,320 --> 00:11:51,880
systems, they keep
collecting data.

270
00:11:51,880 --> 00:11:56,500
So you get data at midnight
on President's Day.

271
00:11:56,500 --> 00:11:59,350
So they're always on.

272
00:11:59,350 --> 00:12:01,210
They're always collecting data.

273
00:12:01,210 --> 00:12:06,170
Manual data needs to be checked,
cleaned, analyzed, coded,

274
00:12:06,170 --> 00:12:08,670
and sometimes put into systems
before they can be analyzed.

275
00:12:08,670 --> 00:12:09,670
That could take a while.

276
00:12:09,670 --> 00:12:11,320
You need to hire
people to do that.

277
00:12:11,320 --> 00:12:15,490
Whereas automatic data
collection systems often

278
00:12:15,490 --> 00:12:17,969
send their data to databases
in real-time or very

279
00:12:17,969 --> 00:12:18,760
close to real-time.

280
00:12:18,760 --> 00:12:24,580
[INAUDIBLE] you can start
analyzing things the next day.

281
00:12:24,580 --> 00:12:28,750
So you arrive in the morning to
your desk at a transit agency,

282
00:12:28,750 --> 00:12:30,790
and you have performance
metrics for yesterday.

283
00:12:30,790 --> 00:12:33,520
So you wouldn't be able to do
that unless you have people

284
00:12:33,520 --> 00:12:36,250
working very hard if
you're using manual data

285
00:12:36,250 --> 00:12:37,870
collection system.

286
00:12:37,870 --> 00:12:41,050
When we talk about automatic
data collection systems,

287
00:12:41,050 --> 00:12:42,220
there are many.

288
00:12:42,220 --> 00:12:47,630
But there are three types that
we refer to very, very often.

289
00:12:47,630 --> 00:12:51,250
And so the first one in AFC,
Automatic Fare Collection

290
00:12:51,250 --> 00:12:52,180
Systems.

291
00:12:52,180 --> 00:12:54,710
This is your fare box or your
fare gates in your smart card,

292
00:12:54,710 --> 00:12:55,370
your Charlie Card.

293
00:12:55,370 --> 00:12:56,078
You're in Boston.

294
00:12:56,078 --> 00:12:57,210
You tap to enter the bus.

295
00:12:57,210 --> 00:13:00,040
And you tap to enter
the subway system.

296
00:13:00,040 --> 00:13:03,220
Increasingly, it's based
on contactless smart cards.

297
00:13:03,220 --> 00:13:04,660
And those contactless
smart cards

298
00:13:04,660 --> 00:13:06,760
have some sort of
RFID technology

299
00:13:06,760 --> 00:13:08,440
with a unique identifier.

300
00:13:08,440 --> 00:13:10,780
When you tap that
card to the sensor,

301
00:13:10,780 --> 00:13:13,240
the sensor will read
that identifier.

302
00:13:13,240 --> 00:13:16,240
And it'll do things like
fare calculation for you.

303
00:13:16,240 --> 00:13:18,760
But that record gets
sent to a database.

304
00:13:18,760 --> 00:13:23,680
And it's there for people
like us to analyze and make

305
00:13:23,680 --> 00:13:25,340
good use of it for planning.

306
00:13:25,340 --> 00:13:29,770
So it tends to provide entry
information almost always.

307
00:13:29,770 --> 00:13:34,810
In some systems, like the
Washington, DC metro or the TFL

308
00:13:34,810 --> 00:13:37,600
subway, you tap in
to enter and exit.

309
00:13:37,600 --> 00:13:41,320
So you have both origin
and destinations.

310
00:13:41,320 --> 00:13:43,690
And if you always
have the systems on,

311
00:13:43,690 --> 00:13:47,050
then you have full spatial
and temporal coverage

312
00:13:47,050 --> 00:13:51,100
of all of the use of the system
at an individual passenger

313
00:13:51,100 --> 00:13:51,600
level.

314
00:13:51,600 --> 00:13:55,040
So very disaggregate--
sorry about that.

315
00:13:55,040 --> 00:13:57,320
Traditionally, these
systems are not real-time.

316
00:13:57,320 --> 00:14:01,340
So it might take a while
for those transactions

317
00:14:01,340 --> 00:14:03,170
to make it to the
data warehouse, where

318
00:14:03,170 --> 00:14:05,810
they're available for
planners to analyze it.

319
00:14:05,810 --> 00:14:10,070
The calculation of how
much fare in some systems

320
00:14:10,070 --> 00:14:11,000
is in real-time.

321
00:14:11,000 --> 00:14:13,400
In other systems like
the Charlie Card,

322
00:14:13,400 --> 00:14:17,210
the stored value that you
have is stored on your card.

323
00:14:17,210 --> 00:14:21,020
So it may take a while if
you tap at a bus for that bus

324
00:14:21,020 --> 00:14:23,570
to go to a garage
and get probed--

325
00:14:23,570 --> 00:14:25,940
and for the data that has
been stored in that bus

326
00:14:25,940 --> 00:14:31,500
to be extracted from that
bus to the central server.

327
00:14:31,500 --> 00:14:33,171
There is a move--

328
00:14:33,171 --> 00:14:34,920
and we'll talk more
about this when we get

329
00:14:34,920 --> 00:14:37,020
to fare policy and technology--

330
00:14:37,020 --> 00:14:39,810
towards using mobile
phone payments

331
00:14:39,810 --> 00:14:42,820
and using contactless
bank card payment systems.

332
00:14:42,820 --> 00:14:45,840
And those systems often
do the full transaction

333
00:14:45,840 --> 00:14:47,040
over the air in real-time.

334
00:14:47,040 --> 00:14:49,770
So we're starting to
look at the possibility

335
00:14:49,770 --> 00:14:52,170
of having all this data
in real-time or almost

336
00:14:52,170 --> 00:14:53,130
in real-time.

337
00:14:53,130 --> 00:14:54,110
But it's not there yet.

338
00:14:54,110 --> 00:14:56,360
AUDIENCE: [INAUDIBLE] can I
ask a question about that?

339
00:14:56,360 --> 00:14:56,980
GABRIEL SANCHEZ-MARTINEZ:
Yeah, of course.

340
00:14:56,980 --> 00:14:59,305
AUDIENCE: In terms
of smart card,

341
00:14:59,305 --> 00:15:01,449
where this balance is
stored on the card--

342
00:15:01,449 --> 00:15:02,740
GABRIEL SANCHEZ-MARTINEZ: Yeah.

343
00:15:02,740 --> 00:15:06,134
AUDIENCE: --if one can figure
out how to hack that card--

344
00:15:06,134 --> 00:15:07,425
GABRIEL SANCHEZ-MARTINEZ: Yeah.

345
00:15:07,425 --> 00:15:08,966
AUDIENCE: --then
what can [INAUDIBLE]

346
00:15:08,966 --> 00:15:12,877
fares through an elaborate
technology that I couldn't do

347
00:15:12,877 --> 00:15:14,320
and most people couldn't do.

348
00:15:14,320 --> 00:15:15,290
But maybe some could.

349
00:15:15,290 --> 00:15:17,081
GABRIEL SANCHEZ-MARTINEZ:
Yeah, definitely.

350
00:15:17,081 --> 00:15:19,880
So the Charlie Card system
is an example about--

351
00:15:19,880 --> 00:15:23,480
actually, MIT students
were the first to hack it.

352
00:15:23,480 --> 00:15:24,980
AUDIENCE: I'm not surprised.

353
00:15:24,980 --> 00:15:28,310
GABRIEL SANCHEZ-MARTINEZ:
So it's older technology.

354
00:15:28,310 --> 00:15:30,530
It used a low-bit
encryption key.

355
00:15:30,530 --> 00:15:32,660
That's a symmetric
encryption key.

356
00:15:32,660 --> 00:15:35,731
And they just brute forced it.

357
00:15:35,731 --> 00:15:36,980
They figured what the key was.

358
00:15:36,980 --> 00:15:39,260
They happened to use the
same key for every card.

359
00:15:39,260 --> 00:15:43,250
So once you broke that key,
you could take any card.

360
00:15:43,250 --> 00:15:45,844
And with the right hardware,
you could add however much value

361
00:15:45,844 --> 00:15:46,760
you want to that card.

362
00:15:46,760 --> 00:15:47,260
And--

363
00:15:47,260 --> 00:15:48,140
AUDIENCE: [INAUDIBLE]

364
00:15:48,140 --> 00:15:50,390
GABRIEL SANCHEZ-MARTINEZ:
Yeah, yeah, exactly.

365
00:15:50,390 --> 00:15:52,700
We don't think it's
been a major problem.

366
00:15:52,700 --> 00:15:54,590
AUDIENCE: But it happens.

367
00:15:54,590 --> 00:15:56,798
GABRIEL SANCHEZ-MARTINEZ:
I haven't seen MIT students

368
00:15:56,798 --> 00:15:58,550
selling special MIT cards.

369
00:15:58,550 --> 00:16:02,690
But that would be
criminal, of course.

370
00:16:02,690 --> 00:16:06,450
Yeah, so newer systems have
much stronger encryption.

371
00:16:06,450 --> 00:16:10,410
And they have different
encryption keys for each card.

372
00:16:10,410 --> 00:16:13,970
And certainly, when we're moving
towards contactless bank cards,

373
00:16:13,970 --> 00:16:17,570
we're talking about a much
more secure encryption.

374
00:16:17,570 --> 00:16:20,270
It's your credit card
that you're using to tap

375
00:16:20,270 --> 00:16:21,539
or your Android or Apple Pay.

376
00:16:21,539 --> 00:16:23,080
AUDIENCE: Account
based [INAUDIBLE]..

377
00:16:23,080 --> 00:16:24,788
GABRIEL SANCHEZ-MARTINEZ:
Account based--

378
00:16:24,788 --> 00:16:27,860
and essentially, what you
have is a token with an ID.

379
00:16:27,860 --> 00:16:32,020
And then, the balance is not
even stored on your card.

380
00:16:32,020 --> 00:16:36,320
The account server is handling
the balance and those things.

381
00:16:36,320 --> 00:16:39,740
So much more difficult to break.

382
00:16:39,740 --> 00:16:42,380
Yup.

383
00:16:42,380 --> 00:16:45,950
OK, AVL systems, or Automatic
Vehicle Location systems--

384
00:16:45,950 --> 00:16:49,250
so these are systems that
track vehicle movement.

385
00:16:49,250 --> 00:16:51,490
So for bus, they tend
to be based on GPS.

386
00:16:51,490 --> 00:16:54,520
You have GPS on a bus, on the
top of the bus, a little hub.

387
00:16:54,520 --> 00:16:58,960
And it collects data every five
seconds or every 10 seconds.

388
00:16:58,960 --> 00:17:04,119
And these positions might
get sent either in real-time,

389
00:17:04,119 --> 00:17:07,089
or maybe they get stored
on the onboard computer

390
00:17:07,089 --> 00:17:10,920
and then are extracted when
the bus reaches the garage.

391
00:17:10,920 --> 00:17:17,160
So just GPS-- sophisticated
AVL systems for bus

392
00:17:17,160 --> 00:17:21,930
also have gyroscopes to do
inertial navigation and dead

393
00:17:21,930 --> 00:17:25,380
reckoning, especially when
the GPS precision drops.

394
00:17:25,380 --> 00:17:28,830
And that happens especially
with the urban canyon effect.

395
00:17:28,830 --> 00:17:31,540
If you have tall buildings,
GPS signal bounces around.

396
00:17:31,540 --> 00:17:36,950
The dilution of precision messes
up the position of the bus.

397
00:17:36,950 --> 00:17:38,790
Or maybe you're
entering a tunnel,

398
00:17:38,790 --> 00:17:42,210
and you want to
continue to get updates

399
00:17:42,210 --> 00:17:43,800
of positions inside the tunnel.

400
00:17:43,800 --> 00:17:45,390
So this is a
temporary system that

401
00:17:45,390 --> 00:17:49,500
kicks in and interpolates
positions and figures

402
00:17:49,500 --> 00:17:51,780
out how the bus is moving.

403
00:17:51,780 --> 00:17:54,119
For a train, it's usually
based on track circuits.

404
00:17:54,119 --> 00:17:56,160
So we're going to talk
more about track circuits.

405
00:17:56,160 --> 00:17:59,160
But essentially, a
track knows if a train

406
00:17:59,160 --> 00:18:02,640
is occupying that segment or
not occupying that segment.

407
00:18:02,640 --> 00:18:09,570
And there are often some sensors
that read with RFID technology

408
00:18:09,570 --> 00:18:11,670
the ID number of a car.

409
00:18:11,670 --> 00:18:14,190
And sometimes, you have a
sensor in the front of each car

410
00:18:14,190 --> 00:18:15,750
and [AUDIO OUT] each car.

411
00:18:15,750 --> 00:18:20,490
And so a computer will look
up the sequence of readings

412
00:18:20,490 --> 00:18:23,610
and follow track circuits
as they are being occupied

413
00:18:23,610 --> 00:18:25,560
and unoccupied--

414
00:18:25,560 --> 00:18:29,530
and in that manner, track
trains throughout the system.

415
00:18:29,530 --> 00:18:32,790
These systems were put in
place mostly for safety

416
00:18:32,790 --> 00:18:35,670
to prevent train crashes.

417
00:18:35,670 --> 00:18:39,330
And because of that, you
would need it to know buses

418
00:18:39,330 --> 00:18:41,310
or where a train was.

419
00:18:41,310 --> 00:18:42,900
They are available in real-time.

420
00:18:42,900 --> 00:18:44,460
They were designed
from the beginning

421
00:18:44,460 --> 00:18:46,000
to track vehicles in real-time.

422
00:18:46,000 --> 00:18:48,086
So that's what we have.

423
00:18:48,086 --> 00:18:49,710
I guess what's newer
is that now, we're

424
00:18:49,710 --> 00:18:52,650
collecting them and keeping
them in a data warehouse

425
00:18:52,650 --> 00:18:54,730
so that we can
analyze running times.

426
00:18:54,730 --> 00:18:56,895
AUDIENCE: [INAUDIBLE]
these systems have benefit

427
00:18:56,895 --> 00:18:58,320
to the consumer?

428
00:18:58,320 --> 00:18:58,680
GABRIEL SANCHEZ-MARTINEZ:
They do.

429
00:18:58,680 --> 00:19:00,638
And that's the newest
thing that has happened--

430
00:19:00,638 --> 00:19:02,460
that nobody thought
about consumers when

431
00:19:02,460 --> 00:19:04,080
they were put in place.

432
00:19:04,080 --> 00:19:07,110
So yeah, we are
talking about tracking,

433
00:19:07,110 --> 00:19:09,780
knowing how many minutes
I have to wait for my bus,

434
00:19:09,780 --> 00:19:10,725
for example.

435
00:19:10,725 --> 00:19:13,380
And those things are pushed
through a public API,

436
00:19:13,380 --> 00:19:16,200
so that if I'm a
smartphone app developer,

437
00:19:16,200 --> 00:19:19,950
I can go ahead and pull
data from this next bus app

438
00:19:19,950 --> 00:19:20,979
and make an app.

439
00:19:20,979 --> 00:19:23,520
And so people can download it,
and they know how many minutes

440
00:19:23,520 --> 00:19:24,380
they have to wait.

441
00:19:24,380 --> 00:19:27,800
Yeah, so definitely.

442
00:19:27,800 --> 00:19:31,170
So we have seen a lot of AVL
being pushed in that manner.

443
00:19:31,170 --> 00:19:35,850
We have not seen so much AFC
data or APC data being pushed.

444
00:19:35,850 --> 00:19:37,980
Obviously, you wouldn't
want all the details

445
00:19:37,980 --> 00:19:39,840
of AFC being pushed.

446
00:19:39,840 --> 00:19:42,540
But you might want to know
how crowded is my next bus,

447
00:19:42,540 --> 00:19:45,100
or how crowded is my next train.

448
00:19:45,100 --> 00:19:46,860
And you might actually
alter your decision

449
00:19:46,860 --> 00:19:48,780
whether to wait
for a crowded train

450
00:19:48,780 --> 00:19:52,640
or walk a longer time
based on that information.

451
00:19:52,640 --> 00:19:54,127
So that's coming.

452
00:19:54,127 --> 00:19:55,710
I think, in the next
few years, that's

453
00:19:55,710 --> 00:19:57,900
going to start happening.

454
00:19:57,900 --> 00:20:00,690
So passenger counting-- many
different technologies exist.

455
00:20:00,690 --> 00:20:05,700
For bus, we tend to have these
optical sensors in the back.

456
00:20:05,700 --> 00:20:08,640
You might see them if
you pay attention--

457
00:20:08,640 --> 00:20:09,740
broken beam sensors.

458
00:20:09,740 --> 00:20:12,210
They look like two little
eyes with two little mirrors

459
00:20:12,210 --> 00:20:13,320
on each door.

460
00:20:13,320 --> 00:20:16,260
And so when you cross
the beams, if you

461
00:20:16,260 --> 00:20:18,870
press one beam first
and then the other,

462
00:20:18,870 --> 00:20:20,280
that sensor will know--

463
00:20:20,280 --> 00:20:22,230
is a person coming into the bus?

464
00:20:22,230 --> 00:20:24,270
Or is a person exiting the bus?

465
00:20:24,270 --> 00:20:26,100
And you have that at each door.

466
00:20:26,100 --> 00:20:31,470
And it counts those beams
going in and going out.

467
00:20:31,470 --> 00:20:34,110
And often, this is
slightly inaccurate.

468
00:20:34,110 --> 00:20:36,780
So you might get more boardings
and lightings for a given trip.

469
00:20:36,780 --> 00:20:39,150
So at the end of
a trip, whatever

470
00:20:39,150 --> 00:20:41,950
remains in terms of imbalance
between boardings and lightings

471
00:20:41,950 --> 00:20:42,900
gets zeroed out.

472
00:20:42,900 --> 00:20:46,910
And the area is distributed
throughout that trip

473
00:20:46,910 --> 00:20:48,372
that was just run.

474
00:20:48,372 --> 00:20:50,580
And often, you still have
to do some error correction

475
00:20:50,580 --> 00:20:51,360
after that.

476
00:20:51,360 --> 00:20:54,420
But it's a way of counting
people getting on and off.

477
00:20:54,420 --> 00:20:57,060
And that's useful to get
how many people are riding

478
00:20:57,060 --> 00:21:00,330
the system and also
the passenger miles--

479
00:21:00,330 --> 00:21:02,720
the passengers multiplied
by distance, which is often

480
00:21:02,720 --> 00:21:07,380
a required reporting element
in things like the NTB,

481
00:21:07,380 --> 00:21:10,020
the National Transit Database.

482
00:21:10,020 --> 00:21:14,420
So for rail systems,
we have gates

483
00:21:14,420 --> 00:21:16,231
that count how many
times they open

484
00:21:16,231 --> 00:21:17,480
and how many times they close.

485
00:21:17,480 --> 00:21:21,530
So you might have that
kind of counting in rail.

486
00:21:21,530 --> 00:21:23,150
You also have
video-based counting--

487
00:21:23,150 --> 00:21:27,710
so camera feeds that
can be hooked up

488
00:21:27,710 --> 00:21:31,990
to a system that will
essentially track nodes moving

489
00:21:31,990 --> 00:21:33,270
inside that frame.

490
00:21:33,270 --> 00:21:36,260
And you can count things
that cross a certain line,

491
00:21:36,260 --> 00:21:37,370
for example.

492
00:21:37,370 --> 00:21:42,020
And you could do
that to count flows.

493
00:21:42,020 --> 00:21:45,200
And then for train, we also
have the weight systems.

494
00:21:45,200 --> 00:21:47,870
So this is only in trains.

495
00:21:47,870 --> 00:21:50,780
The braking systems in
trains apply braking force

496
00:21:50,780 --> 00:21:53,780
in proportion to the
load on each car.

497
00:21:53,780 --> 00:21:55,340
So if you have a
very heavy car, you

498
00:21:55,340 --> 00:21:58,490
need to apply stronger braking
force than in a car that

499
00:21:58,490 --> 00:22:00,050
is almost empty.

500
00:22:00,050 --> 00:22:04,400
If you don't do that, then
you apply a lot more force

501
00:22:04,400 --> 00:22:06,430
per weight on the lighter car.

502
00:22:06,430 --> 00:22:10,070
That car is going to be the
one pushing the other cars

503
00:22:10,070 --> 00:22:12,530
or pulling the other cars
through the coupling.

504
00:22:12,530 --> 00:22:14,480
And that will eventually
break the [INAUDIBLE]

505
00:22:14,480 --> 00:22:15,360
at a faster rate.

506
00:22:15,360 --> 00:22:18,560
So what you want is,
each car to slow down

507
00:22:18,560 --> 00:22:21,690
at the same rate by itself
as much as possible.

508
00:22:21,690 --> 00:22:24,620
And for that, you need to brake
in proportion to the weight.

509
00:22:24,620 --> 00:22:26,630
And therefore, you have
these weight systems.

510
00:22:26,630 --> 00:22:29,000
They used to just do that.

511
00:22:29,000 --> 00:22:30,680
And more recently,
we hooked them

512
00:22:30,680 --> 00:22:33,770
up to a little
storage device that

513
00:22:33,770 --> 00:22:36,830
keeps track of the
weight and maybe Wi-Fi,

514
00:22:36,830 --> 00:22:39,410
so that each time it reaches
a station or the terminal,

515
00:22:39,410 --> 00:22:40,930
it sends the data off.

516
00:22:40,930 --> 00:22:47,240
And we might have a rather
somewhat [? unprecise ?]

517
00:22:47,240 --> 00:22:50,600
idea of how many people
are in the car just based

518
00:22:50,600 --> 00:22:54,440
on an average
weight of a person.

519
00:22:54,440 --> 00:22:56,940
And these are traditionally not
available in real real-time.

520
00:22:56,940 --> 00:22:57,710
[INAUDIBLE] you have questions?

521
00:22:57,710 --> 00:22:58,090
Yeah?

522
00:22:58,090 --> 00:22:59,410
AUDIENCE: You could
also just reconcile it

523
00:22:59,410 --> 00:23:00,760
with the other system, right?

524
00:23:00,760 --> 00:23:01,370
GABRIEL SANCHEZ-MARTINEZ:
Of course, yeah.

525
00:23:01,370 --> 00:23:02,250
AUDIENCE: So if you have--

526
00:23:02,250 --> 00:23:02,460
[INTERPOSING VOICES]

527
00:23:02,460 --> 00:23:02,945
GABRIEL SANCHEZ-MARTINEZ: Yeah.

528
00:23:02,945 --> 00:23:05,370
AUDIENCE: --people early
can transport to get on to.

529
00:23:05,370 --> 00:23:05,520
GABRIEL SANCHEZ-MARTINEZ: Yeah.

530
00:23:05,520 --> 00:23:06,250
AUDIENCE: [INAUDIBLE]

531
00:23:06,250 --> 00:23:07,420
GABRIEL SANCHEZ-MARTINEZ:
Yeah, definitely.

532
00:23:07,420 --> 00:23:07,920
Yeah.

533
00:23:07,920 --> 00:23:11,900
And that's cutting edge research
that's happening right now.

534
00:23:11,900 --> 00:23:14,570
How do you do data fiction
and merge different systems?

535
00:23:14,570 --> 00:23:15,650
They all have errors.

536
00:23:15,650 --> 00:23:17,100
And how do you
detect when one is

537
00:23:17,100 --> 00:23:18,350
more erroneous than the other?

538
00:23:18,350 --> 00:23:20,420
And how do you mix
these data sources

539
00:23:20,420 --> 00:23:23,847
to get the most precise,
not just loads, but paths

540
00:23:23,847 --> 00:23:25,430
within a network and
things like that.

541
00:23:25,430 --> 00:23:26,460
Yeah.

542
00:23:26,460 --> 00:23:31,039
So any questions on these three
very important automatic data

543
00:23:31,039 --> 00:23:31,830
collection systems?

544
00:23:31,830 --> 00:23:32,640
AUDIENCE: [INAUDIBLE]

545
00:23:32,640 --> 00:23:33,889
GABRIEL SANCHEZ-MARTINEZ: Yup.

546
00:23:33,889 --> 00:23:41,426
AUDIENCE: So if
there [INAUDIBLE]

547
00:23:41,426 --> 00:23:45,782
AVL, what kind of reason
can be [INAUDIBLE]??

548
00:23:45,782 --> 00:23:47,490
GABRIEL SANCHEZ-MARTINEZ:
So the question

549
00:23:47,490 --> 00:23:52,780
is, why might some of these
technologies produce errors?

550
00:23:52,780 --> 00:23:55,150
And in particular,
you're asking about AVL.

551
00:23:55,150 --> 00:23:58,090
So each of these has
a different behavior.

552
00:23:58,090 --> 00:24:01,030
And within each of these
categories of technologies,

553
00:24:01,030 --> 00:24:04,870
each vendor's system might have
specific things that happen.

554
00:24:04,870 --> 00:24:06,730
With AVL, the most
common thing is

555
00:24:06,730 --> 00:24:10,900
end of root problems--
detecting when a trip actually

556
00:24:10,900 --> 00:24:12,460
begins and ends.

557
00:24:12,460 --> 00:24:17,020
So AVL systems,
you have this GPS

558
00:24:17,020 --> 00:24:18,450
coming in every five seconds.

559
00:24:18,450 --> 00:24:20,950
Depending on your chip set, you
might get it more frequently

560
00:24:20,950 --> 00:24:21,450
than that.

561
00:24:21,450 --> 00:24:25,230
But you also actually
sometimes hook it to the doors.

562
00:24:25,230 --> 00:24:28,420
So if the door is opening, you
say, well, I must be at a stop.

563
00:24:28,420 --> 00:24:30,880
And therefore, let me
find which one is closest.

564
00:24:30,880 --> 00:24:32,540
So there are ways to correct it.

565
00:24:32,540 --> 00:24:34,750
But when you get to
the end of the route,

566
00:24:34,750 --> 00:24:37,430
it's not clear always--
have you finished your trip?

567
00:24:37,430 --> 00:24:41,290
Or rather, are you
starting your trip already?

568
00:24:41,290 --> 00:24:45,970
So maybe if the terminal is at
the same place on the trip--

569
00:24:45,970 --> 00:24:47,710
the previous trip
ends at the same place

570
00:24:47,710 --> 00:24:49,960
that the next trip
begins, there might

571
00:24:49,960 --> 00:24:53,950
be a time where the doors
open and close various times.

572
00:24:53,950 --> 00:24:56,140
And the trip isn't
ready to leave yet.

573
00:24:56,140 --> 00:24:58,810
And so you really have to
wait to see the bus leaving

574
00:24:58,810 --> 00:25:00,370
that terminal and moving.

575
00:25:00,370 --> 00:25:01,900
Sometimes, there
are false starts.

576
00:25:01,900 --> 00:25:06,040
So maybe another bus comes
along, and it needs that space.

577
00:25:06,040 --> 00:25:10,270
So the driver moves the
bus a few meters forward.

578
00:25:10,270 --> 00:25:13,880
And the system thinks
my trip has started.

579
00:25:13,880 --> 00:25:16,130
And then, when you're
looking at aggregate data,

580
00:25:16,130 --> 00:25:19,120
you're looking at, say, running
times at the trip level.

581
00:25:19,120 --> 00:25:21,940
You see these outliers
with very long times.

582
00:25:21,940 --> 00:25:23,500
And if you were to
plot them by stop,

583
00:25:23,500 --> 00:25:25,510
you see that the link
between the first stop

584
00:25:25,510 --> 00:25:29,360
and the second step is
sometimes very high, 15 minutes.

585
00:25:29,360 --> 00:25:30,880
And so you can throw those out.

586
00:25:30,880 --> 00:25:33,850
Or you can do some interpolation
or imputation of data.

587
00:25:33,850 --> 00:25:36,880
Some systems that care
very much about that

588
00:25:36,880 --> 00:25:40,240
will purposely
place the terminal

589
00:25:40,240 --> 00:25:45,310
stops sufficiently far
apart to prevent that

590
00:25:45,310 --> 00:25:48,210
from happening because
it is a problem.

591
00:25:48,210 --> 00:25:52,050
And this data is crucial to
planning service and figuring

592
00:25:52,050 --> 00:25:54,610
out how much resource you're
going to put into each route.

593
00:25:54,610 --> 00:25:56,856
So yup.

594
00:25:56,856 --> 00:26:03,758
AUDIENCE: For tap cards,
[INAUDIBLE] and metros,

595
00:26:03,758 --> 00:26:07,702
some of them we have
to tap out to exit.

596
00:26:07,702 --> 00:26:09,903
It is because of
variable [INAUDIBLE]..

597
00:26:09,903 --> 00:26:11,153
GABRIEL SANCHEZ-MARTINEZ: Yes.

598
00:26:11,153 --> 00:26:14,604
AUDIENCE: But in some systems,
it's still a flat fare.

599
00:26:14,604 --> 00:26:16,083
You still have to tap out.

600
00:26:16,083 --> 00:26:18,548
Is the reason behind that
mostly data collection?

601
00:26:18,548 --> 00:26:20,766
Or is there anything
[INAUDIBLE] you're

602
00:26:20,766 --> 00:26:22,695
going to still have to
tap out [INAUDIBLE]??

603
00:26:22,695 --> 00:26:24,170
GABRIEL SANCHEZ-MARTINEZ:
So yeah, no examples of it

604
00:26:24,170 --> 00:26:24,999
come to mind.

605
00:26:24,999 --> 00:26:25,790
You might know one.

606
00:26:25,790 --> 00:26:27,360
AUDIENCE: MARTA?

607
00:26:27,360 --> 00:26:29,460
GABRIEL SANCHEZ-MARTINEZ:
OK, I haven't visited.

608
00:26:29,460 --> 00:26:31,890
So yeah, data collection
might be a reason to do that.

609
00:26:31,890 --> 00:26:35,640
But I'll have to get back to
you on why MARTA did that.

610
00:26:35,640 --> 00:26:41,090
But yeah, most systems that
have controls in and out

611
00:26:41,090 --> 00:26:43,680
are for fare policy
reasons and not

612
00:26:43,680 --> 00:26:46,560
for data collection reasons.

613
00:26:46,560 --> 00:26:49,980
We're starting to see more
interest in data collection

614
00:26:49,980 --> 00:26:53,797
and in investing on
these technologies just

615
00:26:53,797 --> 00:26:54,630
for data collection.

616
00:26:54,630 --> 00:26:58,220
So maybe-- but I'll have to
check and get back to you.

617
00:26:58,220 --> 00:27:01,177
AUDIENCE: You mentioned some
systems separate their depots

618
00:27:01,177 --> 00:27:03,260
to not confuse the end
[? from the start point. ?]

619
00:27:03,260 --> 00:27:03,515
[INTERPOSING VOICES]

620
00:27:03,515 --> 00:27:04,760
GABRIEL SANCHEZ-MARTINEZ:
Their terminal stops, yeah.

621
00:27:04,760 --> 00:27:06,720
AUDIENCE: What are
some examples of those?

622
00:27:06,720 --> 00:27:10,510
GABRIEL SANCHEZ-MARTINEZ: TFL
will do that in London, yeah.

623
00:27:10,510 --> 00:27:11,960
Yeah, so they'll monitor this.

624
00:27:11,960 --> 00:27:17,110
And if they see that
this is occurring often,

625
00:27:17,110 --> 00:27:20,319
they will separate
the stops a bit.

626
00:27:20,319 --> 00:27:22,110
And the reason they do
that is because they

627
00:27:22,110 --> 00:27:26,190
have people whose job
it is to impute data

628
00:27:26,190 --> 00:27:27,420
when it's incorrect.

629
00:27:27,420 --> 00:27:30,330
So if they don't do that, and
the system is consistently

630
00:27:30,330 --> 00:27:32,170
producing bad data,
then that means

631
00:27:32,170 --> 00:27:35,850
they're going to have to spend
human resources on correcting

632
00:27:35,850 --> 00:27:37,050
that data.

633
00:27:37,050 --> 00:27:38,520
So at some point,
it's just easier

634
00:27:38,520 --> 00:27:40,350
to move the stop a little bit.

635
00:27:40,350 --> 00:27:42,427
It doesn't have to
be a long distance.

636
00:27:42,427 --> 00:27:43,135
AUDIENCE: Got it.

637
00:27:43,135 --> 00:27:45,260
GABRIEL SANCHEZ-MARTINEZ:
It does not make the same

638
00:27:45,260 --> 00:27:48,030
and make it far enough apart
that the geo fences can

639
00:27:48,030 --> 00:27:51,180
be told apart from each other.

640
00:27:51,180 --> 00:27:51,680
Alright?

641
00:27:51,680 --> 00:27:54,382
AUDIENCE: Really small scale
data of the EZRide who I work

642
00:27:54,382 --> 00:27:57,922
for, actually you could
see real-time bus loads

643
00:27:57,922 --> 00:27:59,340
[INAUDIBLE]--

644
00:27:59,340 --> 00:28:02,349
GABRIEL SANCHEZ-MARTINEZ:
Oh, interesting.

645
00:28:02,349 --> 00:28:04,890
AUDIENCE: --which was actually
helpful if you're dispatching,

646
00:28:04,890 --> 00:28:07,950
and you know a bus is
getting through people on it.

647
00:28:07,950 --> 00:28:08,450
[INAUDIBLE]

648
00:28:08,450 --> 00:28:10,450
GABRIEL SANCHEZ-MARTINEZ:
Yeah, for real-time control.

649
00:28:10,450 --> 00:28:11,070
[INTERPOSING VOICES]

650
00:28:11,070 --> 00:28:12,778
AUDIENCE: But the
terminal at our station

651
00:28:12,778 --> 00:28:15,082
had a drop-off point
and a pick-up point.

652
00:28:15,082 --> 00:28:18,004
The drop-off point was
before layover [INAUDIBLE]

653
00:28:18,004 --> 00:28:21,570
was after for this exact
reason to make sure

654
00:28:21,570 --> 00:28:23,361
that it will go through
the drop-off point,

655
00:28:23,361 --> 00:28:25,009
reset, until people
get off of it.

656
00:28:25,009 --> 00:28:26,300
GABRIEL SANCHEZ-MARTINEZ: Yeah.

657
00:28:26,300 --> 00:28:27,192
Yeah, so it happens.

658
00:28:27,192 --> 00:28:28,025
[INTERPOSING VOICES]

659
00:28:28,025 --> 00:28:28,900
AUDIENCE: Definitely.

660
00:28:28,900 --> 00:28:31,179
[INAUDIBLE]

661
00:28:31,179 --> 00:28:33,262
GABRIEL SANCHEZ-MARTINEZ:
That sounds about right.

662
00:28:36,324 --> 00:28:37,740
OK, if there are
no more questions

663
00:28:37,740 --> 00:28:41,859
on the three very important
categories of automated data

664
00:28:41,859 --> 00:28:43,650
collection systems,
let's talk a little bit

665
00:28:43,650 --> 00:28:46,360
about the data collection
program design process.

666
00:28:46,360 --> 00:28:49,920
So this comes from before
automatic data collection.

667
00:28:49,920 --> 00:28:53,179
And nowadays, we think a
little bit less about this.

668
00:28:53,179 --> 00:28:54,220
But it's still important.

669
00:28:54,220 --> 00:28:59,010
So if you do need to
collect some data,

670
00:28:59,010 --> 00:29:01,500
there's a structure that you
can follow to do it properly

671
00:29:01,500 --> 00:29:03,660
and to make sure that you
collect data efficiently,

672
00:29:03,660 --> 00:29:06,624
so that you don't spend too much
resources on data collection

673
00:29:06,624 --> 00:29:08,790
and that you can answer
your policy or your planning

674
00:29:08,790 --> 00:29:09,840
questions.

675
00:29:09,840 --> 00:29:15,060
So based on your needs and
the properties of your agency,

676
00:29:15,060 --> 00:29:17,400
I say here, determine
property characteristics.

677
00:29:17,400 --> 00:29:18,630
That's a North American term.

678
00:29:18,630 --> 00:29:20,050
A property is an agency.

679
00:29:20,050 --> 00:29:23,259
So if you see that,
that's an agency.

680
00:29:23,259 --> 00:29:25,800
So based on the characteristics
of the service you're running

681
00:29:25,800 --> 00:29:28,811
and your data needs, you can
select some data collection

682
00:29:28,811 --> 00:29:29,310
technique.

683
00:29:29,310 --> 00:29:31,690
We'll get into what
some of these are.

684
00:29:31,690 --> 00:29:35,070
Then, you can develop
route-by-route sampling plans

685
00:29:35,070 --> 00:29:39,810
based on how variable
the data is in each case.

686
00:29:39,810 --> 00:29:41,900
And you can determine how
many checkers do I need.

687
00:29:41,900 --> 00:29:44,760
A checker is a person who
goes out and collects data.

688
00:29:44,760 --> 00:29:46,770
And then from that, the cost--

689
00:29:46,770 --> 00:29:47,790
so human resources.

690
00:29:47,790 --> 00:29:49,440
It's a planning exercise.

691
00:29:49,440 --> 00:29:52,740
And what we do usually is that
we conduct a baseline phase.

692
00:29:52,740 --> 00:29:57,280
So that's the first time
you go out and collect data.

693
00:29:57,280 --> 00:30:00,150
You don't know much
about what you're

694
00:30:00,150 --> 00:30:01,970
wanting to collect data on.

695
00:30:01,970 --> 00:30:06,870
So it might be only
matrices, or loads,

696
00:30:06,870 --> 00:30:09,600
the people getting on and off.

697
00:30:09,600 --> 00:30:13,350
So you have to go out
and do a bigger effort.

698
00:30:13,350 --> 00:30:15,810
And that's called the
baseline phase effort.

699
00:30:15,810 --> 00:30:19,020
Once you've done that and you've
established some tendencies,

700
00:30:19,020 --> 00:30:22,080
you might want to monitor
that to see if it changes.

701
00:30:22,080 --> 00:30:25,530
So then, you do a lighter weight
data collection effort, where

702
00:30:25,530 --> 00:30:29,220
you go out and less frequently,
using fewer resources,

703
00:30:29,220 --> 00:30:31,980
you collect sometimes
the same thing.

704
00:30:31,980 --> 00:30:37,890
Or sometimes, you observe
something else that is related

705
00:30:37,890 --> 00:30:41,100
or can be correlated with
what you really want.

706
00:30:41,100 --> 00:30:44,340
And then based on a
relationship between the two,

707
00:30:44,340 --> 00:30:46,560
you can estimate
what you really want.

708
00:30:46,560 --> 00:30:50,090
So you can monitor
what you collected.

709
00:30:50,090 --> 00:30:51,870
And then, if you
detect that there's

710
00:30:51,870 --> 00:30:54,360
been a trend or a change, and
you need to investigate it

711
00:30:54,360 --> 00:30:57,420
further, you might go ahead
and repeat the baseline phase

712
00:30:57,420 --> 00:30:59,530
to increase your accuracy.

713
00:30:59,530 --> 00:31:04,080
So one of the catches of
this is that to determine

714
00:31:04,080 --> 00:31:06,600
sampling plans, to
determine required sample

715
00:31:06,600 --> 00:31:09,700
sizes to achieve some
confidence interval,

716
00:31:09,700 --> 00:31:12,120
you need to know how
variable your data is.

717
00:31:12,120 --> 00:31:15,030
And if you haven't collected
it yet, you don't know.

718
00:31:15,030 --> 00:31:18,350
So you might have some default
values that you resort to.

719
00:31:18,350 --> 00:31:20,800
And we'll get to that
later in this lecture.

720
00:31:20,800 --> 00:31:22,770
But you might also
do a pre-test, where

721
00:31:22,770 --> 00:31:24,270
you send some
people out, and you

722
00:31:24,270 --> 00:31:27,150
collect some data
to really start

723
00:31:27,150 --> 00:31:30,030
to get a sense of
how variable is it,

724
00:31:30,030 --> 00:31:35,090
and how big will my
sample requirements be,

725
00:31:35,090 --> 00:31:37,560
and how much will it
cost for me to do this.

726
00:31:37,560 --> 00:31:40,810
So this is the process
that you might follow.

727
00:31:40,810 --> 00:31:44,622
And there are different
data needs by the question

728
00:31:44,622 --> 00:31:45,830
that you're trying to answer.

729
00:31:45,830 --> 00:31:48,430
So one way of looking
at that is, are you

730
00:31:48,430 --> 00:31:51,130
collecting things that
are for specific routes,

731
00:31:51,130 --> 00:31:54,070
or for specific route
segments, or at the stop level?

732
00:31:54,070 --> 00:31:57,950
Or are you using more aggregate
system level data collection?

733
00:31:57,950 --> 00:32:00,100
Are your questions
more system level?

734
00:32:00,100 --> 00:32:02,920
So system-level things
are more about reporting,

735
00:32:02,920 --> 00:32:06,730
and they might be tied to
things like federal funding.

736
00:32:06,730 --> 00:32:09,610
Whereas route-level things
and stop-level things

737
00:32:09,610 --> 00:32:12,020
are more important for planning.

738
00:32:12,020 --> 00:32:14,860
So when we talk about route
and route segment level,

739
00:32:14,860 --> 00:32:17,350
we're looking at things like
loads at the peak load points

740
00:32:17,350 --> 00:32:18,580
or at some other key points.

741
00:32:18,580 --> 00:32:20,900
How many people are in the bus?

742
00:32:20,900 --> 00:32:23,980
The running time
is by the segment

743
00:32:23,980 --> 00:32:26,260
to do schedule that
has time points

744
00:32:26,260 --> 00:32:30,010
or maybe end-to-end to
your operations plan.

745
00:32:30,010 --> 00:32:32,470
Schedule adherence-- are
these buses running on time?

746
00:32:32,470 --> 00:32:34,870
Or are my schedules
not realistic?

747
00:32:34,870 --> 00:32:37,120
Total boardings or
revenue, two things

748
00:32:37,120 --> 00:32:41,590
that are highly correlated--
so number of passenger trips.

749
00:32:41,590 --> 00:32:44,014
Boardings by fare
category-- so you might say,

750
00:32:44,014 --> 00:32:45,430
well, I want
boardings, but I want

751
00:32:45,430 --> 00:32:47,170
to know how many
seniors are using this,

752
00:32:47,170 --> 00:32:50,500
and how many students are using
this, and how many people are

753
00:32:50,500 --> 00:32:52,540
using monthly passes,
and how many people are

754
00:32:52,540 --> 00:32:55,750
using pay-per-ride.

755
00:32:55,750 --> 00:32:58,600
So you have different
fare categories.

756
00:32:58,600 --> 00:33:03,430
And you might want to
segregate the data by that.

757
00:33:03,430 --> 00:33:05,920
You might want passenger
boarding and lighting by stop.

758
00:33:05,920 --> 00:33:07,840
So that's what
APC would give you

759
00:33:07,840 --> 00:33:10,900
if you have an automated system.

760
00:33:10,900 --> 00:33:13,540
But you might also use a write
checker, who sits on the bus

761
00:33:13,540 --> 00:33:16,660
and counts people
boarding in a lighting.

762
00:33:16,660 --> 00:33:19,690
Transfer rates between routes--
to see you maybe you're

763
00:33:19,690 --> 00:33:23,460
looking at changing
service so that people

764
00:33:23,460 --> 00:33:25,920
don't have to transfer.

765
00:33:25,920 --> 00:33:28,440
Passenger characteristics
and attitudes-- this usually

766
00:33:28,440 --> 00:33:31,020
requires some degree
of survey, where

767
00:33:31,020 --> 00:33:35,622
you ask people things,
passenger travel patterns.

768
00:33:35,622 --> 00:33:37,080
At the system level,
we have things

769
00:33:37,080 --> 00:33:39,840
like unlinked passenger
trips, passenger miles, linked

770
00:33:39,840 --> 00:33:40,817
passenger trips.

771
00:33:40,817 --> 00:33:42,150
This had the whole system level.

772
00:33:42,150 --> 00:33:45,990
So sometimes, you do route
level or route segment level

773
00:33:45,990 --> 00:33:47,400
analysis, and
then, you aggregate

774
00:33:47,400 --> 00:33:48,750
to get the system-level things.

775
00:33:48,750 --> 00:33:50,970
That's usually how you proceed.

776
00:33:50,970 --> 00:33:54,120
But the requirements in
terms of how many of these

777
00:33:54,120 --> 00:33:56,204
you have to sample
might be different.

778
00:33:56,204 --> 00:33:58,620
So if you want to achieve a
certain accuracy at the system

779
00:33:58,620 --> 00:34:01,260
level, you don't need
to achieve the accuracy

780
00:34:01,260 --> 00:34:04,830
for each of the routes
that are in that system

781
00:34:04,830 --> 00:34:07,740
because you might have--

782
00:34:07,740 --> 00:34:11,400
so if you want to
say 90% confidence

783
00:34:11,400 --> 00:34:15,810
in some system-level
data element,

784
00:34:15,810 --> 00:34:19,239
you might only need 80% or
70% of the element level.

785
00:34:19,239 --> 00:34:21,000
And once you bring
those altogether,

786
00:34:21,000 --> 00:34:23,159
you achieve the
90% that you need.

787
00:34:23,159 --> 00:34:27,840
So data inference, I talked
about how sometimes we

788
00:34:27,840 --> 00:34:33,280
can infer items if we don't
observe them directly.

789
00:34:33,280 --> 00:34:36,540
So from AFC with AFC is a
low-fare collection system,

790
00:34:36,540 --> 00:34:39,449
we have boardings because
people are tapping into the bus

791
00:34:39,449 --> 00:34:41,560
or tapping into
the subway system.

792
00:34:41,560 --> 00:34:44,909
And if we have APC, we
count people getting on.

793
00:34:44,909 --> 00:34:49,360
So we can look at total
number of boardings that way,

794
00:34:49,360 --> 00:34:50,670
if that makes sense.

795
00:34:50,670 --> 00:34:51,690
That's pretty direct.

796
00:34:51,690 --> 00:34:54,300
Sometimes, you want to correct
for errors in the APC system,

797
00:34:54,300 --> 00:34:57,271
or you might have things
like variation affecting

798
00:34:57,271 --> 00:34:59,520
that number-- like it goes
from AFC to how many people

799
00:34:59,520 --> 00:35:00,720
were actually in that bus.

800
00:35:00,720 --> 00:35:02,100
How many people
actually boarded?

801
00:35:02,100 --> 00:35:05,160
So you might do a little
bit of manual surveys

802
00:35:05,160 --> 00:35:09,450
to check what that relationship
is and apply some correction.

803
00:35:09,450 --> 00:35:11,340
For passenger miles,
we need to know

804
00:35:11,340 --> 00:35:15,330
how many people are at the
bus between each stop here.

805
00:35:15,330 --> 00:35:18,930
So AFC gives you boardings
and only boardings.

806
00:35:18,930 --> 00:35:20,820
APC gives you ons and offs.

807
00:35:20,820 --> 00:35:23,700
If every bus had APC, then you
could calculate passenger miles

808
00:35:23,700 --> 00:35:24,630
directly.

809
00:35:24,630 --> 00:35:28,710
But often, you have systems
where only a portion

810
00:35:28,710 --> 00:35:29,970
of the fleet has APC.

811
00:35:29,970 --> 00:35:33,240
So maybe 15% of your fleet
is equipped with APC.

812
00:35:33,240 --> 00:35:38,220
And from that, you get
the sample OD matrix.

813
00:35:38,220 --> 00:35:40,080
And you can use that
OD matrix to convert

814
00:35:40,080 --> 00:35:43,560
from boardings only to the
distribution and the ons

815
00:35:43,560 --> 00:35:45,810
and offs at all bus routes.

816
00:35:45,810 --> 00:35:47,760
And from that, you can
get passenger miles.

817
00:35:47,760 --> 00:35:50,280
Or you might just
use your buses that

818
00:35:50,280 --> 00:35:54,720
have APC, if that suffices
for your data collection unit.

819
00:35:54,720 --> 00:35:59,070
Same thing with peak
point load-- similar idea.

820
00:35:59,070 --> 00:36:01,200
The AFC only measures boardings.

821
00:36:01,200 --> 00:36:03,940
So it doesn't give you the
peak point load automatically.

822
00:36:03,940 --> 00:36:05,670
But from APC, you could get it.

823
00:36:05,670 --> 00:36:09,390
And it you can establish a
relationship between boardings

824
00:36:09,390 --> 00:36:11,460
and the peak load
point, then you

825
00:36:11,460 --> 00:36:14,730
can use that model to
infer the peak load

826
00:36:14,730 --> 00:36:16,200
point from just boardings.

827
00:36:16,200 --> 00:36:19,620
So this is a key thing to
be efficient about data

828
00:36:19,620 --> 00:36:20,580
collection.

829
00:36:20,580 --> 00:36:24,194
Any questions on this idea?

830
00:36:24,194 --> 00:36:25,176
Yup.

831
00:36:25,176 --> 00:36:27,140
AUDIENCE: So to get
passenger miles,

832
00:36:27,140 --> 00:36:29,104
you're also going
to have a GPS system

833
00:36:29,104 --> 00:36:31,068
as well to know the distance?

834
00:36:31,068 --> 00:36:33,277
Or are we just basically
[INAUDIBLE] this

835
00:36:33,277 --> 00:36:34,699
is the routing [INAUDIBLE]?

836
00:36:34,699 --> 00:36:35,990
GABRIEL SANCHEZ-MARTINEZ: Both.

837
00:36:35,990 --> 00:36:36,790
AUDIENCE: [INAUDIBLE]

838
00:36:36,790 --> 00:36:37,730
GABRIEL SANCHEZ-MARTINEZ:
Yeah, both.

839
00:36:37,730 --> 00:36:39,190
AUDIENCE: [INAUDIBLE]

840
00:36:39,190 --> 00:36:41,106
GABRIEL SANCHEZ-MARTINEZ:
What tends to happen

841
00:36:41,106 --> 00:36:44,770
is that the APC, it'll come in.

842
00:36:44,770 --> 00:36:47,720
And it'll say, at this stop,
this many people boarded.

843
00:36:47,720 --> 00:36:48,940
This many people are lighted.

844
00:36:48,940 --> 00:36:52,740
So you have other
layers in your database

845
00:36:52,740 --> 00:36:55,780
that say where the buses
and what the distance

846
00:36:55,780 --> 00:37:00,450
is between stops and
the stop pair level.

847
00:37:00,450 --> 00:37:03,267
So you then essentially
know how many people

848
00:37:03,267 --> 00:37:05,350
are riding on each link
and how long that link is,

849
00:37:05,350 --> 00:37:06,391
and you multiply the two.

850
00:37:06,391 --> 00:37:09,000
So yeah, passenger miles.

851
00:37:09,000 --> 00:37:10,640
Yeah, more questions.

852
00:37:10,640 --> 00:37:13,420
AUDIENCE: Yeah, for these
checks that are going on

853
00:37:13,420 --> 00:37:14,649
like the more manual checks--

854
00:37:14,649 --> 00:37:15,940
GABRIEL SANCHEZ-MARTINEZ: Yeah.

855
00:37:15,940 --> 00:37:17,100
AUDIENCE: --I know
often, there's

856
00:37:17,100 --> 00:37:18,900
derivation checkers who
are coming into a check.

857
00:37:18,900 --> 00:37:20,775
GABRIEL SANCHEZ-MARTINEZ:
That's right, yeah.

858
00:37:20,775 --> 00:37:23,480
AUDIENCE: Do they also use
that data to cross-reference

859
00:37:23,480 --> 00:37:25,370
the passenger counts?

860
00:37:25,370 --> 00:37:26,790
As in, [? this ?]
person gets on,

861
00:37:26,790 --> 00:37:28,939
and they check everyone's
voice to [INAUDIBLE] DFL.

862
00:37:28,939 --> 00:37:30,230
GABRIEL SANCHEZ-MARTINEZ: Yeah.

863
00:37:30,230 --> 00:37:32,970
AUDIENCE: They then know
exactly how they go on the bus.

864
00:37:32,970 --> 00:37:34,220
GABRIEL SANCHEZ-MARTINEZ: Yes.

865
00:37:34,220 --> 00:37:34,615
Yeah.

866
00:37:34,615 --> 00:37:35,800
AUDIENCE: Do they use that data?

867
00:37:35,800 --> 00:37:36,740
GABRIEL SANCHEZ-MARTINEZ:
Yeah, they can.

868
00:37:36,740 --> 00:37:39,350
In the APC, sometimes
there's reliability problems,

869
00:37:39,350 --> 00:37:41,200
especially when
vehicles are very

870
00:37:41,200 --> 00:37:43,090
full because
sometimes, people will

871
00:37:43,090 --> 00:37:44,530
block the sensor by the door.

872
00:37:47,230 --> 00:37:49,030
Actually, people like
to stand by the door

873
00:37:49,030 --> 00:37:50,821
all the time, even when
the bus isn't full.

874
00:37:50,821 --> 00:37:53,235
And that kind of affects APC.

875
00:37:53,235 --> 00:37:54,610
You might notice
this on the one.

876
00:37:54,610 --> 00:37:55,600
If you take the one--

877
00:37:55,600 --> 00:38:00,970
so yeah, you sometimes have a
little bit of a manual effort

878
00:38:00,970 --> 00:38:01,840
to figure out.

879
00:38:01,840 --> 00:38:03,730
Just learn about
your APC system,

880
00:38:03,730 --> 00:38:06,190
and what are the errors,
and when do you see them.

881
00:38:06,190 --> 00:38:09,430
It often happens that you
have more variation when

882
00:38:09,430 --> 00:38:10,540
you have very high loads.

883
00:38:10,540 --> 00:38:12,400
And that's when APC
is least accurate.

884
00:38:12,400 --> 00:38:15,880
So it all comes together.

885
00:38:15,880 --> 00:38:17,294
Yeah.

886
00:38:17,294 --> 00:38:18,210
Questions on the back?

887
00:38:18,210 --> 00:38:19,330
I think I saw a question.

888
00:38:19,330 --> 00:38:19,829
No?

889
00:38:19,829 --> 00:38:22,864
AUDIENCE: Yeah, I
noticed that in Chicago,

890
00:38:22,864 --> 00:38:26,720
when the bus would be crowded,
then people get off the bus.

891
00:38:26,720 --> 00:38:28,180
They let people off--

892
00:38:28,180 --> 00:38:28,660
GABRIEL SANCHEZ-MARTINEZ:
That's right.

893
00:38:28,660 --> 00:38:28,990
AUDIENCE: --and then back on.

894
00:38:28,990 --> 00:38:29,510
GABRIEL SANCHEZ-MARTINEZ: Yeah.

895
00:38:29,510 --> 00:38:30,010
Yeah.

896
00:38:30,010 --> 00:38:31,210
These double things.

897
00:38:31,210 --> 00:38:33,296
But somebody might be by
the door just blocking

898
00:38:33,296 --> 00:38:34,295
the two little sensors--

899
00:38:34,295 --> 00:38:34,990
[INTERPOSING VOICES]

900
00:38:34,990 --> 00:38:36,990
GABRIEL SANCHEZ-MARTINEZ:
--the two little eyes.

901
00:38:36,990 --> 00:38:40,710
And that's it, no records
of people getting on or off.

902
00:38:44,369 --> 00:38:46,660
So if you're doing a little
data collection, as I said,

903
00:38:46,660 --> 00:38:48,010
we use checkers.

904
00:38:48,010 --> 00:38:50,260
And actually, your
second assignment, you

905
00:38:50,260 --> 00:38:52,630
will be checkers of some kind.

906
00:38:52,630 --> 00:38:55,550
The typical checkers which you
won't be in this assignment

907
00:38:55,550 --> 00:38:57,940
are ride checkers
and point checkers.

908
00:38:57,940 --> 00:39:02,050
So a ride checker sits
in the vehicle and rides

909
00:39:02,050 --> 00:39:03,370
with the vehicle.

910
00:39:03,370 --> 00:39:07,540
And the typical thing that these
ride checkers are looking at

911
00:39:07,540 --> 00:39:10,380
is, how long did it take
to cover some distance?

912
00:39:10,380 --> 00:39:12,479
So what was the running
time for that trip?

913
00:39:12,479 --> 00:39:14,020
And also, people
getting on and off--

914
00:39:14,020 --> 00:39:16,090
so they act as APC essentially.

915
00:39:16,090 --> 00:39:18,140
And they act as AVL.

916
00:39:18,140 --> 00:39:20,770
So AVL and APC
together might replace

917
00:39:20,770 --> 00:39:23,170
most of the functionality
of a ride checker.

918
00:39:23,170 --> 00:39:26,980
Although a ride checker often
can conduct an onboard survey,

919
00:39:26,980 --> 00:39:30,250
asking passengers about
where are they going,

920
00:39:30,250 --> 00:39:33,590
or their trip purpose, or things
related to social demographics,

921
00:39:33,590 --> 00:39:38,530
which are qualitative and cannot
be collected with the sensors.

922
00:39:38,530 --> 00:39:40,900
Point checkers stand
outside of the vehicle.

923
00:39:40,900 --> 00:39:43,720
They stay at a specific
place, and they

924
00:39:43,720 --> 00:39:46,360
can look at headways
between buses--

925
00:39:46,360 --> 00:39:49,570
so how long did it take
between each bus to come by,

926
00:39:49,570 --> 00:39:52,120
and how loaded were these buses?

927
00:39:52,120 --> 00:39:55,540
So if you're interested
in the peak load point,

928
00:39:55,540 --> 00:39:57,400
and you know where the
peak load point is,

929
00:39:57,400 --> 00:40:01,612
and you just want
to observe, measure

930
00:40:01,612 --> 00:40:03,070
what are the loads
of the peak load

931
00:40:03,070 --> 00:40:05,530
point, then you can
just station a point

932
00:40:05,530 --> 00:40:06,880
checker at the peak load point.

933
00:40:06,880 --> 00:40:09,310
And if that person
is strained, we'll

934
00:40:09,310 --> 00:40:13,270
be able to more or less say how
many people are in the vehicle

935
00:40:13,270 --> 00:40:16,680
from looking at the vehicle.

936
00:40:16,680 --> 00:40:19,212
With automated data
collection systems--

937
00:40:19,212 --> 00:40:21,420
yeah, with a fair system,
we have passenger accounts.

938
00:40:21,420 --> 00:40:23,410
We have transaction
data, which is very rich.

939
00:40:23,410 --> 00:40:25,440
It will tell you not
only that somebody

940
00:40:25,440 --> 00:40:27,930
is entering or exiting,
but also how much they're

941
00:40:27,930 --> 00:40:32,430
paying, sometimes information
about the fare product

942
00:40:32,430 --> 00:40:36,510
type, which might help you
infer if this person is

943
00:40:36,510 --> 00:40:39,630
a senior, or a student, or a
frequent user, an infrequent

944
00:40:39,630 --> 00:40:40,290
user--

945
00:40:40,290 --> 00:40:43,490
so many things that are
very useful for planning.

946
00:40:43,490 --> 00:40:46,060
And we'll get to play with some
of these later in the course.

947
00:40:46,060 --> 00:40:49,750
And then, there's Automatic
Passenger Counters, APC.

948
00:40:49,750 --> 00:40:54,880
So as more and motor systems
switch to automatic data

949
00:40:54,880 --> 00:40:57,940
collection, we still use
some manual data collection,

950
00:40:57,940 --> 00:41:01,000
but not in the
traditional sense.

951
00:41:01,000 --> 00:41:02,920
Now, we reserve those
resources for things

952
00:41:02,920 --> 00:41:07,180
like surveys about social
demographics and other things.

953
00:41:07,180 --> 00:41:10,640
And we also carry out
web-based surveys,

954
00:41:10,640 --> 00:41:12,640
which would have some biases.

955
00:41:12,640 --> 00:41:16,260
But if people
registered their cards,

956
00:41:16,260 --> 00:41:18,010
and you have email
accounts, you can maybe

957
00:41:18,010 --> 00:41:21,730
send a mass email to everyone
and carry out surveys.

958
00:41:21,730 --> 00:41:23,110
The MBTA does that.

959
00:41:23,110 --> 00:41:25,090
Maybe some of you
are in the panel

960
00:41:25,090 --> 00:41:27,670
of people who are e-mailed
every now and then.

961
00:41:27,670 --> 00:41:29,370
Is anybody in that panel?

962
00:41:29,370 --> 00:41:30,350
No hands.

963
00:41:30,350 --> 00:41:31,410
I'm in that panel.

964
00:41:31,410 --> 00:41:35,006
But I know somebody must be.

965
00:41:35,006 --> 00:41:37,630
So yeah, they send an email, and
they ask about your last ride.

966
00:41:37,630 --> 00:41:39,370
And they say, where
did you start from?

967
00:41:39,370 --> 00:41:41,350
What were you doing
this trip for?

968
00:41:41,350 --> 00:41:43,900
How long did you have to walk?

969
00:41:43,900 --> 00:41:45,450
Are you happy with the system?

970
00:41:45,450 --> 00:41:47,130
Was your bus on time?

971
00:41:47,130 --> 00:41:48,130
Yeah, things like that--

972
00:41:48,130 --> 00:41:51,000
how satisfied are you?

973
00:41:51,000 --> 00:41:53,200
It's a survey with
qualitative questions

974
00:41:53,200 --> 00:41:55,450
that you couldn't
collect automatically.

975
00:41:55,450 --> 00:41:57,850
It's [INAUDIBLE] seeing
things about your experience

976
00:41:57,850 --> 00:42:03,370
outside of the bus, which
there are no sensors for.

977
00:42:03,370 --> 00:42:05,170
All right, sampling
strategies-- a bunch

978
00:42:05,170 --> 00:42:08,200
of different ones
and the simplest one

979
00:42:08,200 --> 00:42:12,200
is called simple random
sampling-- very, very simple.

980
00:42:12,200 --> 00:42:14,310
So when you have
sample random sampling,

981
00:42:14,310 --> 00:42:16,060
what happens is that
every trip, if you're

982
00:42:16,060 --> 00:42:18,850
looking at surveying trips,
for things like how many people

983
00:42:18,850 --> 00:42:19,930
boarded this trip--

984
00:42:19,930 --> 00:42:22,960
let's take that as an example.

985
00:42:22,960 --> 00:42:24,920
Then, if you're using
simple random sampling,

986
00:42:24,920 --> 00:42:27,490
every trip has equal likelihood
of being picked and being

987
00:42:27,490 --> 00:42:28,330
surveyed.

988
00:42:28,330 --> 00:42:33,610
So if you go through
your process,

989
00:42:33,610 --> 00:42:35,560
and you determine that
you need to observe 100

990
00:42:35,560 --> 00:42:38,550
trips to get an
average reliably.

991
00:42:38,550 --> 00:42:42,380
And you're going to use
that to plan something,

992
00:42:42,380 --> 00:42:44,450
then you need to
look at 100 trips.

993
00:42:44,450 --> 00:42:46,270
So if you use simple
random sampling,

994
00:42:46,270 --> 00:42:50,020
you take your schedule, and
you randomly pick 100 trips.

995
00:42:50,020 --> 00:42:51,190
And that's your sample.

996
00:42:51,190 --> 00:42:53,860
Those are the ones that you
send people out to collect data.

997
00:42:53,860 --> 00:42:56,450
Now, there's a little bit
of a problem with that.

998
00:42:56,450 --> 00:42:57,880
It's not the most
efficient method

999
00:42:57,880 --> 00:42:59,713
because if you're going
to send someone out,

1000
00:42:59,713 --> 00:43:03,340
and that person is going to be
active, and require some time

1001
00:43:03,340 --> 00:43:06,520
to get to the site and
some time to return, then

1002
00:43:06,520 --> 00:43:08,260
once they're out
there, you want them

1003
00:43:08,260 --> 00:43:10,240
to collect as much as they can.

1004
00:43:10,240 --> 00:43:12,100
So that's not simple
random sampling.

1005
00:43:12,100 --> 00:43:14,600
That's cluster sampling.

1006
00:43:14,600 --> 00:43:16,460
Before we get to that
systematic sampling--

1007
00:43:16,460 --> 00:43:21,350
so typically, instead of
picking randomly, we say,

1008
00:43:21,350 --> 00:43:26,000
OK, we need to get
10% of the trips.

1009
00:43:26,000 --> 00:43:30,290
So let's just make it
such that we count.

1010
00:43:30,290 --> 00:43:33,680
And maybe it's every five
trips, we have to survey it.

1011
00:43:33,680 --> 00:43:35,780
So now, it's evenly spaced.

1012
00:43:35,780 --> 00:43:38,850
And this is useful
for some things.

1013
00:43:38,850 --> 00:43:41,180
One example is weekday,
picking the weekday

1014
00:43:41,180 --> 00:43:43,610
that you're going to survey on.

1015
00:43:43,610 --> 00:43:47,340
So the technique that is often
used is sample every six days.

1016
00:43:47,340 --> 00:43:49,581
Why would that be?

1017
00:43:49,581 --> 00:43:50,080
Yeah.

1018
00:43:50,080 --> 00:43:53,000
So if you do it every seven,
then you always have a Monday.

1019
00:43:53,000 --> 00:43:54,655
And that's going
to get some bias

1020
00:43:54,655 --> 00:43:57,120
if Mondays happen to
be low ridership days

1021
00:43:57,120 --> 00:43:58,420
or high ridership days.

1022
00:43:58,420 --> 00:44:01,630
So if do every sixth
day over a year,

1023
00:44:01,630 --> 00:44:04,390
you have a good sample
of every week day.

1024
00:44:04,390 --> 00:44:07,510
So that's an example
of systematic sampling.

1025
00:44:07,510 --> 00:44:11,830
But you still have
that issue of it

1026
00:44:11,830 --> 00:44:13,540
might not be the most efficient.

1027
00:44:13,540 --> 00:44:17,740
Cluster sampling, sometimes
it's more efficient

1028
00:44:17,740 --> 00:44:20,110
once you send out
a person to collect

1029
00:44:20,110 --> 00:44:22,690
data to do as much as possible.

1030
00:44:22,690 --> 00:44:24,760
And you survey a cluster.

1031
00:44:24,760 --> 00:44:28,510
So one example is, if
you're distributing surveys

1032
00:44:28,510 --> 00:44:31,180
to passengers, and you need
to distribute 100 surveys.

1033
00:44:31,180 --> 00:44:35,071
If you do 100 simple
random sample,

1034
00:44:35,071 --> 00:44:37,570
then those people might be in
different parts of the system.

1035
00:44:37,570 --> 00:44:40,570
And one might be
the first person

1036
00:44:40,570 --> 00:44:43,000
you see getting off
at South Station.

1037
00:44:43,000 --> 00:44:44,920
And then another
one by me might be

1038
00:44:44,920 --> 00:44:48,700
the first person you see getting
off at the Kendall station.

1039
00:44:48,700 --> 00:44:50,270
So that's very inefficient.

1040
00:44:50,270 --> 00:44:53,070
So a cluster might be
everybody on board a bus,

1041
00:44:53,070 --> 00:44:55,690
and that will get a
bunch of people together.

1042
00:44:55,690 --> 00:44:59,470
However, it's not as efficient
statistically to do that.

1043
00:44:59,470 --> 00:45:01,930
So you can't just add
up to 100, and you're

1044
00:45:01,930 --> 00:45:07,360
done because there might be some
correlation within the people

1045
00:45:07,360 --> 00:45:09,670
riding that vehicle
that they will tend

1046
00:45:09,670 --> 00:45:12,310
to answer in a similar way.

1047
00:45:12,310 --> 00:45:14,410
So you might need to
increase your sample size

1048
00:45:14,410 --> 00:45:15,576
when you use this technique.

1049
00:45:15,576 --> 00:45:19,830
But still, you might have a
more efficient sampling plan.

1050
00:45:19,830 --> 00:45:21,290
Then, there is the
ratio estimation

1051
00:45:21,290 --> 00:45:22,520
and conversion factors.

1052
00:45:22,520 --> 00:45:24,560
We gave examples
of this already.

1053
00:45:24,560 --> 00:45:26,820
This is in the context
of baseline phase

1054
00:45:26,820 --> 00:45:28,770
and then monitoring phase.

1055
00:45:28,770 --> 00:45:31,930
So you start out with
a baseline phase.

1056
00:45:31,930 --> 00:45:33,790
And in the baseline
phase, you collect

1057
00:45:33,790 --> 00:45:36,640
the thing you really
want and something

1058
00:45:36,640 --> 00:45:40,480
that is very easily collected
with lower resources.

1059
00:45:40,480 --> 00:45:42,850
And you make a model
of the thing you really

1060
00:45:42,850 --> 00:45:45,910
want as a function
of the thing that

1061
00:45:45,910 --> 00:45:47,920
is cheap and easy to collect.

1062
00:45:47,920 --> 00:45:49,420
And then, on the
monitoring phase,

1063
00:45:49,420 --> 00:45:54,310
you only measure the thing that
is cheap, and easy, and quick.

1064
00:45:54,310 --> 00:45:57,890
And you then use the model to
estimate what you really want.

1065
00:45:57,890 --> 00:46:00,790
So converting AFC boarding
to passenger miles,

1066
00:46:00,790 --> 00:46:02,320
we give an example of that.

1067
00:46:02,320 --> 00:46:04,090
We're converting
loads at checkpoints

1068
00:46:04,090 --> 00:46:05,840
to load somewhere else.

1069
00:46:05,840 --> 00:46:07,840
So maybe only measure
loads with a point

1070
00:46:07,840 --> 00:46:09,640
checker at the peak load point.

1071
00:46:09,640 --> 00:46:12,910
And you have some relationship
to convert those loads

1072
00:46:12,910 --> 00:46:18,702
to loads at other key transfer
stations as an example.

1073
00:46:18,702 --> 00:46:20,160
And then, the
stratified sampling--

1074
00:46:20,160 --> 00:46:23,970
so one of the things that
determines how big of a sample

1075
00:46:23,970 --> 00:46:25,830
you need is the
variability in the data

1076
00:46:25,830 --> 00:46:26,910
that you're collecting.

1077
00:46:26,910 --> 00:46:30,900
So correlation,
when you're looking

1078
00:46:30,900 --> 00:46:35,650
at a whole system with multiple
routes or multiple segments--

1079
00:46:35,650 --> 00:46:37,690
maybe when you
look at one route,

1080
00:46:37,690 --> 00:46:42,550
there's some variability
of running times.

1081
00:46:42,550 --> 00:46:44,770
But they have a central
tendency as well.

1082
00:46:44,770 --> 00:46:46,420
And when you've
got a second route,

1083
00:46:46,420 --> 00:46:48,392
you have also some
variability and

1084
00:46:48,392 --> 00:46:49,600
a different central tendency.

1085
00:46:49,600 --> 00:46:51,624
So you bunch all
the data together,

1086
00:46:51,624 --> 00:46:54,040
some of the variability across
data points in our data set

1087
00:46:54,040 --> 00:46:56,980
are going to be the inherent
variability of each route.

1088
00:46:56,980 --> 00:46:59,920
And some of it will be
systematic-- the differences

1089
00:46:59,920 --> 00:47:01,390
between both routes.

1090
00:47:01,390 --> 00:47:03,340
So if you do a
simple random sample,

1091
00:47:03,340 --> 00:47:05,800
and you don't separate
the systematic variability

1092
00:47:05,800 --> 00:47:08,560
from the inherent
variability, then you're

1093
00:47:08,560 --> 00:47:10,650
going to get a
wider variability.

1094
00:47:10,650 --> 00:47:13,270
And you will require
a bigger sample size.

1095
00:47:13,270 --> 00:47:14,950
Stratified sampling
is an approach

1096
00:47:14,950 --> 00:47:18,790
where you determine sample sizes
for each of these separately.

1097
00:47:18,790 --> 00:47:21,790
And it's more efficient
if you do it well

1098
00:47:21,790 --> 00:47:25,270
because you eliminate
the need, or you at least

1099
00:47:25,270 --> 00:47:28,600
reduce the need, to
collect data for the sake

1100
00:47:28,600 --> 00:47:32,380
of the systematic differences
between different parts

1101
00:47:32,380 --> 00:47:35,090
of the system.

1102
00:47:35,090 --> 00:47:36,450
Any questions on these methods?

1103
00:47:39,614 --> 00:47:41,066
Yes.

1104
00:47:41,066 --> 00:47:42,518
AUDIENCE: [INAUDIBLE]

1105
00:47:45,034 --> 00:47:46,450
GABRIEL SANCHEZ-MARTINEZ:
Yeah, so

1106
00:47:46,450 --> 00:47:47,860
let's maybe pick
another example.

1107
00:47:55,330 --> 00:48:01,130
Let's say that you're looking
at the proportion of passengers

1108
00:48:01,130 --> 00:48:04,070
in a bus who are students.

1109
00:48:04,070 --> 00:48:05,660
And you're
distributing a survey.

1110
00:48:05,660 --> 00:48:11,800
And they tell you whether
they're students or not.

1111
00:48:11,800 --> 00:48:13,820
And you want this
for the whole system

1112
00:48:13,820 --> 00:48:16,830
or for at least a
group of routes.

1113
00:48:16,830 --> 00:48:19,900
And it tends to be that some
routes don't serve universities

1114
00:48:19,900 --> 00:48:20,900
and don't serve schools.

1115
00:48:20,900 --> 00:48:24,020
So they have a lower
proportion of people.

1116
00:48:24,020 --> 00:48:26,690
And then, some routes that
do go through universities,

1117
00:48:26,690 --> 00:48:28,860
and they have a higher
proportion of students.

1118
00:48:28,860 --> 00:48:33,290
So if you just want the
system-wide proportion

1119
00:48:33,290 --> 00:48:36,890
of people who are students, and
you join all these data points

1120
00:48:36,890 --> 00:48:39,320
together, there's going
to be a lot of variability

1121
00:48:39,320 --> 00:48:41,630
in what proportion that
is across every trip

1122
00:48:41,630 --> 00:48:44,930
that you survey, correct?

1123
00:48:44,930 --> 00:48:49,610
So in some sense,
it will indicate

1124
00:48:49,610 --> 00:48:51,290
that because of
that variability,

1125
00:48:51,290 --> 00:48:55,260
you're going to need a
higher sampling size.

1126
00:48:55,260 --> 00:48:57,830
You're going to have
to survey more trips

1127
00:48:57,830 --> 00:49:02,810
to get at your desired
accuracy level and tolerance.

1128
00:49:02,810 --> 00:49:06,080
But now, if you say no, I'm
going to split routes in two,

1129
00:49:06,080 --> 00:49:07,100
into two stratas.

1130
00:49:07,100 --> 00:49:11,060
One is the routes that
serve the universities.

1131
00:49:11,060 --> 00:49:16,700
And these tend to have
around 50% proportion.

1132
00:49:16,700 --> 00:49:19,610
And then, there's the routes
that don't serve universities.

1133
00:49:19,610 --> 00:49:23,180
And these tend to have
proportions near 0.

1134
00:49:23,180 --> 00:49:27,260
So if you're in your 0, you
might require a lower sample

1135
00:49:27,260 --> 00:49:28,700
size to cover those.

1136
00:49:28,700 --> 00:49:30,410
And you can just
very efficiently

1137
00:49:30,410 --> 00:49:32,480
cover most of your
bus routes that way.

1138
00:49:32,480 --> 00:49:35,030
And then, focus your
efforts on just the ones

1139
00:49:35,030 --> 00:49:37,160
that have higher proportion.

1140
00:49:37,160 --> 00:49:39,980
And you achieved your
system-level tolerance

1141
00:49:39,980 --> 00:49:44,990
requirements with much fewer,
with by far fewer resources

1142
00:49:44,990 --> 00:49:47,069
required to collect the data.

1143
00:49:47,069 --> 00:49:48,360
Does that answer your question?

1144
00:49:48,360 --> 00:49:48,942
Yeah.

1145
00:49:48,942 --> 00:49:50,298
AUDIENCE: [INAUDIBLE]

1146
00:49:52,260 --> 00:49:54,510
GABRIEL SANCHEZ-MARTINEZ:
So what he meant by inherent

1147
00:49:54,510 --> 00:49:57,600
is that within each bus
route or within each strata,

1148
00:49:57,600 --> 00:49:59,130
there will be some variability.

1149
00:49:59,130 --> 00:50:02,130
Even within the trips that
are serving universities,

1150
00:50:02,130 --> 00:50:04,530
every trip might have
a different proportion.

1151
00:50:04,530 --> 00:50:07,120
So there's going to be a little
bit of variability in that.

1152
00:50:07,120 --> 00:50:10,600
But if you mix that with trips
that are not serving students,

1153
00:50:10,600 --> 00:50:12,992
then you pull all
that data together.

1154
00:50:12,992 --> 00:50:15,450
Then, it's going to look like
the variance of that data set

1155
00:50:15,450 --> 00:50:16,170
is much higher.

1156
00:50:20,950 --> 00:50:23,200
All right, so we've
tossed these terms

1157
00:50:23,200 --> 00:50:25,430
around-- tolerance,
confidence, level accuracy.

1158
00:50:25,430 --> 00:50:27,996
So let's define
them more precisely.

1159
00:50:27,996 --> 00:50:29,620
Accuracy-- when we
talk about accuracy,

1160
00:50:29,620 --> 00:50:31,960
that has two dimensions.

1161
00:50:31,960 --> 00:50:36,070
So somebody might say, the
average boardings per trip

1162
00:50:36,070 --> 00:50:38,032
is 33.1.

1163
00:50:38,032 --> 00:50:39,490
And then, the
question that follows

1164
00:50:39,490 --> 00:50:42,070
is, do you mean exactly 33.1?

1165
00:50:42,070 --> 00:50:43,570
How certain are you of that?

1166
00:50:43,570 --> 00:50:45,100
And how accurate is that?

1167
00:50:45,100 --> 00:50:48,970
So when we talk about tolerance,
there's relative tolerance,

1168
00:50:48,970 --> 00:50:50,860
and there's absolute tolerance.

1169
00:50:50,860 --> 00:50:52,750
Relative tolerance
is expressed in terms

1170
00:50:52,750 --> 00:50:57,760
of a percent of the amount you
were collecting or a fraction.

1171
00:50:57,760 --> 00:51:01,660
So you might say mean
boardings per trip is 33.1,

1172
00:51:01,660 --> 00:51:03,170
plus or minus 10%.

1173
00:51:03,170 --> 00:51:05,710
And that's the 10% of 33.1.

1174
00:51:05,710 --> 00:51:07,876
That's why it's
relative tolerance.

1175
00:51:07,876 --> 00:51:09,250
Then, there's
absolute tolerance.

1176
00:51:09,250 --> 00:51:14,240
So mean boarding per trip
is 33.1, plus or minus 3.3.

1177
00:51:14,240 --> 00:51:17,630
Now, in this case, these
two are equivalent.

1178
00:51:17,630 --> 00:51:20,810
3.3 in absolute
terms is 10% of 33.1.

1179
00:51:20,810 --> 00:51:23,600
But this was expressed
in absolute terms,

1180
00:51:23,600 --> 00:51:25,820
and the previous one was
expressed in relative terms.

1181
00:51:28,766 --> 00:51:32,130
So don't always assume
that if you see a percent,

1182
00:51:32,130 --> 00:51:35,190
it's relative because if what
you're measuring is in itself

1183
00:51:35,190 --> 00:51:38,850
a percent, unless you're
using a percent of a percent,

1184
00:51:38,850 --> 00:51:39,930
then it's absolute.

1185
00:51:39,930 --> 00:51:41,940
So here's an example.

1186
00:51:41,940 --> 00:51:46,740
Mean percentage of students
is 23%, plus or minus 5%.

1187
00:51:46,740 --> 00:51:49,785
That's absolute because
it's 5%, not 5% of 23%.

1188
00:51:54,660 --> 00:51:57,480
First, we talked about,
is that exactly 33.1?

1189
00:51:57,480 --> 00:52:00,010
Or is it something
different from 33.1?

1190
00:52:00,010 --> 00:52:02,460
Then, the second question
is, how sure are you,

1191
00:52:02,460 --> 00:52:06,310
how confident are you
that the number you give,

1192
00:52:06,310 --> 00:52:12,320
plus or minus the tolerance
you give, is the right answer?

1193
00:52:12,320 --> 00:52:15,400
So now, you say
I'm 95% confident

1194
00:52:15,400 --> 00:52:18,355
that the mean boardings per
trip is 33.1, plus or minus 10%.

1195
00:52:18,355 --> 00:52:20,347
So now, you combine
the tolerance

1196
00:52:20,347 --> 00:52:21,430
with the confidence level.

1197
00:52:21,430 --> 00:52:24,170
And that's the full
expression of your accuracy.

1198
00:52:24,170 --> 00:52:27,740
And that's what you need when
we look at the data collection.

1199
00:52:27,740 --> 00:52:30,800
So you have two different
things that you could play with.

1200
00:52:30,800 --> 00:52:33,860
And what happens typically
is that you choose

1201
00:52:33,860 --> 00:52:35,210
a high confidence level--

1202
00:52:35,210 --> 00:52:38,150
90%, 95 percent are typical.

1203
00:52:38,150 --> 00:52:39,830
And then, you hold that fixed.

1204
00:52:39,830 --> 00:52:42,830
And you calculate what
level of accuracy you need.

1205
00:52:42,830 --> 00:52:45,020
Or rather, you decide
what level of accuracy

1206
00:52:45,020 --> 00:52:48,110
you need, depending on the
question you want to answer,

1207
00:52:48,110 --> 00:52:51,560
and the impact it could
have on the system.

1208
00:52:51,560 --> 00:52:54,350
So if you're looking to
[INAUDIBLE] something

1209
00:52:54,350 --> 00:53:01,850
that will have very significant
effects on the service plan

1210
00:53:01,850 --> 00:53:04,070
or maybe on investment
in the system,

1211
00:53:04,070 --> 00:53:07,430
then you might need
a higher accuracy.

1212
00:53:07,430 --> 00:53:10,597
But if you're collecting
data just for reporting,

1213
00:53:10,597 --> 00:53:11,930
maybe it doesn't matter as much.

1214
00:53:11,930 --> 00:53:15,830
And you don't need to spend as
much money on data collection.

1215
00:53:15,830 --> 00:53:20,540
So as an example here, the
National Transit Database--

1216
00:53:20,540 --> 00:53:23,150
NTD, we call it NTD--

1217
00:53:23,150 --> 00:53:26,150
for annual boardings and
passenger miles, it says,

1218
00:53:26,150 --> 00:53:27,740
you should collect
data to achieve

1219
00:53:27,740 --> 00:53:31,890
an accuracy of 10%, relative
tolerance at 95% confidence

1220
00:53:31,890 --> 00:53:33,250
level.

1221
00:53:33,250 --> 00:53:36,090
You need both.

1222
00:53:36,090 --> 00:53:38,340
So take home message about this.

1223
00:53:38,340 --> 00:53:40,630
The other thing,
the t distribution--

1224
00:53:40,630 --> 00:53:43,920
so this is a probability
distribution that

1225
00:53:43,920 --> 00:53:44,960
is bell-shaped.

1226
00:53:44,960 --> 00:53:47,490
It kind of looks like
the normal distribution.

1227
00:53:47,490 --> 00:53:49,440
And it approaches the
normal distribution

1228
00:53:49,440 --> 00:53:52,330
as the sample size
gets very large.

1229
00:53:52,330 --> 00:53:54,960
This is the distribution
that arises naturally

1230
00:53:54,960 --> 00:53:58,110
when you're estimating the
mean of a population that

1231
00:53:58,110 --> 00:54:01,950
is normally distributed with
unknown mean and variance

1232
00:54:01,950 --> 00:54:04,380
and some known sample size.

1233
00:54:04,380 --> 00:54:08,870
So to the right here,
we have your equations

1234
00:54:08,870 --> 00:54:11,990
that I'm sure you've seen
before for sample mean, sample

1235
00:54:11,990 --> 00:54:13,880
variance.

1236
00:54:13,880 --> 00:54:15,740
And I guess, what's
important to think

1237
00:54:15,740 --> 00:54:18,470
about is that the
distribution of what

1238
00:54:18,470 --> 00:54:20,220
you're collecting--
for example, you

1239
00:54:20,220 --> 00:54:23,630
might be collecting data on a
number of people boarding route

1240
00:54:23,630 --> 00:54:25,100
1.

1241
00:54:25,100 --> 00:54:29,390
So that might have
some distribution.

1242
00:54:29,390 --> 00:54:31,440
As you collect
more and more data,

1243
00:54:31,440 --> 00:54:36,350
so as you survey
more and more trips,

1244
00:54:36,350 --> 00:54:40,700
the distribution of how
many people board each trip

1245
00:54:40,700 --> 00:54:43,400
does not necessarily
have to be normal.

1246
00:54:43,400 --> 00:54:45,980
But it turns out from
the Central Limit Theorem

1247
00:54:45,980 --> 00:54:52,990
and other laws and properties
of statistics and probability

1248
00:54:52,990 --> 00:54:54,920
that the distribution
of the estimator--

1249
00:54:54,920 --> 00:54:58,570
so the distribution of the
mean that you calculate based

1250
00:54:58,570 --> 00:55:00,040
on that sample that
you collected--

1251
00:55:00,040 --> 00:55:03,380
is normally distributed as
the sample size increases.

1252
00:55:03,380 --> 00:55:06,402
So if you have a
lower sample size,

1253
00:55:06,402 --> 00:55:08,110
instead of using the
normal distribution,

1254
00:55:08,110 --> 00:55:10,650
use t distribution.

1255
00:55:10,650 --> 00:55:12,730
Sometimes, we call that
a student, the t student

1256
00:55:12,730 --> 00:55:13,780
distribution.

1257
00:55:13,780 --> 00:55:20,440
And this distribution gets wider
as the variability increases

1258
00:55:20,440 --> 00:55:23,310
and as the sample
size gets smaller.

1259
00:55:23,310 --> 00:55:26,090
It has a property called
degrees of freedom,

1260
00:55:26,090 --> 00:55:28,360
which is sample size minus 1.

1261
00:55:28,360 --> 00:55:31,294
And you can see from this
chart right here when

1262
00:55:31,294 --> 00:55:32,710
you have degrees
of freedom equals

1263
00:55:32,710 --> 00:55:35,540
1, which means you
collected two data points,

1264
00:55:35,540 --> 00:55:38,870
it's wider than when
V approaches infinity.

1265
00:55:38,870 --> 00:55:42,610
And what you have in black here,
the thinnest and least variable

1266
00:55:42,610 --> 00:55:46,520
of these, is essentially
a normal distribution.

1267
00:55:46,520 --> 00:55:48,990
And this is the distribution
not of what you collected.

1268
00:55:48,990 --> 00:55:52,540
It's not the distribution
of the number

1269
00:55:52,540 --> 00:55:54,250
of people who boarded route 1.

1270
00:55:54,250 --> 00:55:58,690
It's the distribution of
the mean that you estimate.

1271
00:55:58,690 --> 00:55:59,860
AUDIENCE: [INAUDIBLE]

1272
00:55:59,860 --> 00:56:00,550
GABRIEL SANCHEZ-MARTINEZ:
Exactly, it's

1273
00:56:00,550 --> 00:56:02,420
a sampling distribution
of the mean.

1274
00:56:02,420 --> 00:56:05,980
And if you were to repeat that
experiment with the same number

1275
00:56:05,980 --> 00:56:08,680
of trips but different
number of trips,

1276
00:56:08,680 --> 00:56:11,320
you might get a
slightly different mean.

1277
00:56:11,320 --> 00:56:14,110
So if you were to repeat
that many, many times,

1278
00:56:14,110 --> 00:56:19,145
the distribution of those means
would be shaped in this manner.

1279
00:56:19,145 --> 00:56:20,020
AUDIENCE: [INAUDIBLE]

1280
00:56:20,020 --> 00:56:22,519
GABRIEL SANCHEZ-MARTINEZ: Yeah,
well, student t distributed.

1281
00:56:22,519 --> 00:56:26,700
And as sample size increases to
infinity, normally distributed.

1282
00:56:26,700 --> 00:56:27,200
Harry.

1283
00:56:27,200 --> 00:56:32,009
AUDIENCE: So just for V equals
5, I think you [INAUDIBLE]..

1284
00:56:32,009 --> 00:56:33,175
GABRIEL SANCHEZ-MARTINEZ: 4.

1285
00:56:33,175 --> 00:56:33,620
AUDIENCE: 4.

1286
00:56:33,620 --> 00:56:34,460
GABRIEL SANCHEZ-MARTINEZ:
Sorry, 6.

1287
00:56:34,460 --> 00:56:35,308
6.

1288
00:56:35,308 --> 00:56:36,955
AUDIENCE: Approximately
5 [INAUDIBLE]..

1289
00:56:36,955 --> 00:56:38,330
GABRIEL
SANCHEZ-MARTINEZ: Yes, 6.

1290
00:56:38,330 --> 00:56:39,280
Yeah.

1291
00:56:39,280 --> 00:56:41,950
I mispoke.

1292
00:56:41,950 --> 00:56:43,940
[INAUDIBLE]

1293
00:56:43,940 --> 00:56:47,340
AUDIENCE: When there's a sample
variance, sigma x squared

1294
00:56:47,340 --> 00:56:48,320
equals roughly.

1295
00:56:48,320 --> 00:56:50,250
Is that not supposed
to be an equals?

1296
00:56:50,250 --> 00:56:52,890
Is that not the way the
sample variances define?

1297
00:56:52,890 --> 00:56:56,104
Because I thought it's the--

1298
00:56:56,104 --> 00:56:57,645
GABRIEL SANCHEZ-MARTINEZ:
So-- --it's

1299
00:56:57,645 --> 00:56:59,103
below the variance
of distribution.

1300
00:56:59,103 --> 00:57:02,110
But that's roughly [INAUDIBLE].

1301
00:57:02,110 --> 00:57:05,380
AUDIENCE: Yeah, I guess
the issue is that you

1302
00:57:05,380 --> 00:57:10,060
don't know the true mean.

1303
00:57:10,060 --> 00:57:14,530
So you're using an estimate to
calculate the sample variance.

1304
00:57:14,530 --> 00:57:17,517
And therefore, it's almost,
almost the sample variance.

1305
00:57:17,517 --> 00:57:18,850
GABRIEL SANCHEZ-MARTINEZ: Right.

1306
00:57:18,850 --> 00:57:19,782
But I thought--

1307
00:57:19,782 --> 00:57:21,240
AUDIENCE: You're
using an estimator

1308
00:57:21,240 --> 00:57:23,340
to do the-- that's
what you have to do.

1309
00:57:23,340 --> 00:57:24,700
[INTERPOSING VOICES]

1310
00:57:24,700 --> 00:57:26,390
AUDIENCE: He's
incorporating the fact

1311
00:57:26,390 --> 00:57:29,350
we're dividing by n minus 1
rather dividing by [INAUDIBLE]..

1312
00:57:29,350 --> 00:57:31,225
GABRIEL SANCHEZ-MARTINEZ:
No, so n minus 1,

1313
00:57:31,225 --> 00:57:34,450
that has to do with the
degrees of freedom issue.

1314
00:57:34,450 --> 00:57:38,750
And that's to go from population
variance to sample variance.

1315
00:57:38,750 --> 00:57:40,630
But the other thing
that happens is

1316
00:57:40,630 --> 00:57:43,760
that if you're doing
the population,

1317
00:57:43,760 --> 00:57:46,200
then you know exactly
what your mean is.

1318
00:57:46,200 --> 00:57:47,295
It's exact, right?

1319
00:57:47,295 --> 00:57:47,920
AUDIENCE: Yeah.

1320
00:57:47,920 --> 00:57:49,920
GABRIEL SANCHEZ-MARTINEZ:
And then in that case,

1321
00:57:49,920 --> 00:57:52,500
you would know what the
exact variances is as well.

1322
00:57:52,500 --> 00:57:53,080
Yeah.

1323
00:57:53,080 --> 00:57:55,900
So the n minus 1
is just to remove

1324
00:57:55,900 --> 00:57:59,700
a bias that would arise from
collecting only a sample.

1325
00:57:59,700 --> 00:58:01,460
AUDIENCE: But here
for example, you

1326
00:58:01,460 --> 00:58:03,960
can say this is
equals to [INAUDIBLE]..

1327
00:58:03,960 --> 00:58:04,960
GABRIEL SANCHEZ-MARTINEZ:
Yeah, yeah, yeah, yeah.

1328
00:58:04,960 --> 00:58:05,792
AUDIENCE: You're
working with the sample

1329
00:58:05,792 --> 00:58:07,890
to know it would be an
approximate [INAUDIBLE]..

1330
00:58:07,890 --> 00:58:09,310
GABRIEL SANCHEZ-MARTINEZ:
Yeah, in practice equal 2.

1331
00:58:09,310 --> 00:58:11,270
AUDIENCE: As your
sample distribution

1332
00:58:11,270 --> 00:58:13,377
increases, then obviously,
your sample increases--

1333
00:58:13,377 --> 00:58:14,210
[INTERPOSING VOICES]

1334
00:58:14,210 --> 00:58:14,920
GABRIEL SANCHEZ-MARTINEZ:
And therefore, this

1335
00:58:14,920 --> 00:58:16,030
becomes more and more accurate.

1336
00:58:16,030 --> 00:58:16,660
AUDIENCE: [INAUDIBLE]

1337
00:58:16,660 --> 00:58:17,080
GABRIEL SANCHEZ-MARTINEZ:
Exactly.

1338
00:58:17,080 --> 00:58:18,640
AUDIENCE: It should be
approaching more [INAUDIBLE]..

1339
00:58:18,640 --> 00:58:19,180
GABRIEL SANCHEZ-MARTINEZ:
Yeah, so I

1340
00:58:19,180 --> 00:58:20,620
guess what's
important to realize

1341
00:58:20,620 --> 00:58:27,430
is that this is an estimate of
the population variance, which

1342
00:58:27,430 --> 00:58:30,260
in itself uses another estimate.

1343
00:58:30,260 --> 00:58:32,200
And I guess, that's
why that's there.

1344
00:58:32,200 --> 00:58:33,530
But it's a very small detail.

1345
00:58:33,530 --> 00:58:37,496
I didn't mean to distract you.

1346
00:58:37,496 --> 00:58:42,046
AUDIENCE: So for the n, is it
the sum of all the different

1347
00:58:42,046 --> 00:58:43,730
samples of [INAUDIBLE]
or is it just--

1348
00:58:43,730 --> 00:58:44,070
[INTERPOSING VOICES]

1349
00:58:44,070 --> 00:58:45,630
GABRIEL SANCHEZ-MARTINEZ:
So you don't ever

1350
00:58:45,630 --> 00:58:47,160
repeat the experiment like this.

1351
00:58:47,160 --> 00:58:49,800
This is more of a
theoretical explanation

1352
00:58:49,800 --> 00:58:52,420
to why there is a
distribution to the mean,

1353
00:58:52,420 --> 00:58:53,850
even though you only have one.

1354
00:58:53,850 --> 00:58:55,380
You only have one mean, right?

1355
00:58:55,380 --> 00:58:57,340
Because you're going
to collect data.

1356
00:58:57,340 --> 00:58:59,100
And once you finish
collecting data,

1357
00:58:59,100 --> 00:59:01,680
you're going to calculate
the mean of all that data.

1358
00:59:01,680 --> 00:59:04,020
So you only have one mean.

1359
00:59:04,020 --> 00:59:08,160
If you were hypothetically
to repeat that experiment,

1360
00:59:08,160 --> 00:59:10,656
and you calculated separate
means for each one,

1361
00:59:10,656 --> 00:59:12,030
then you would
get a distribution

1362
00:59:12,030 --> 00:59:14,220
that would look like this.

1363
00:59:14,220 --> 00:59:17,100
In practice, you would just
increase your sample size

1364
00:59:17,100 --> 00:59:21,554
and still compute one mean,
which would be more accurate.

1365
00:59:21,554 --> 00:59:22,054
Yeah.

1366
00:59:24,910 --> 00:59:27,010
OK, let's move on.

1367
00:59:27,010 --> 00:59:28,629
So tolerance and
confidence level--

1368
00:59:28,629 --> 00:59:29,920
so we have these distributions.

1369
00:59:29,920 --> 00:59:33,760
These are the distributions
of the statistics,

1370
00:59:33,760 --> 00:59:35,310
of the mean in this case.

1371
00:59:35,310 --> 00:59:36,640
They are bell-shaped.

1372
00:59:36,640 --> 00:59:41,977
As your sample size increases,
the degrees of freedom goes up.

1373
00:59:41,977 --> 00:59:43,060
And your accuracy goes up.

1374
00:59:43,060 --> 00:59:45,790
And the variance of that
statistic distribution

1375
00:59:45,790 --> 00:59:46,370
decreases.

1376
00:59:46,370 --> 00:59:47,770
So it gets thinner.

1377
00:59:47,770 --> 00:59:52,240
So here in red, you have a
distribution with a smaller

1378
00:59:52,240 --> 00:59:55,570
sample, and therefore, less
accuracy or less confidence

1379
00:59:55,570 --> 00:59:56,590
would look like.

1380
00:59:56,590 --> 00:59:59,260
And then as you increase
your sample size,

1381
00:59:59,260 --> 01:00:04,120
you see that it
becomes more peaky.

1382
01:00:04,120 --> 01:00:09,130
So when we talk about
tolerance, and let's

1383
01:00:09,130 --> 01:00:11,170
come back to the concept
of absolute tolerance

1384
01:00:11,170 --> 01:00:13,330
in particular, we're
talking about the distance

1385
01:00:13,330 --> 01:00:16,000
between the center of
that distribution, which

1386
01:00:16,000 --> 01:00:20,020
is a symmetrical
distribution, and some limit.

1387
01:00:20,020 --> 01:00:24,460
So we're saying, if you have
a tolerance of plus/minus 10.

1388
01:00:24,460 --> 01:00:28,390
Then, you're going to
measure 10, say 10 boardings,

1389
01:00:28,390 --> 01:00:32,270
from the center to the right
and from the center to the left.

1390
01:00:32,270 --> 01:00:35,590
And that's your
absolute tolerance.

1391
01:00:35,590 --> 01:00:38,410
So when you calculate
absolute tolerance,

1392
01:00:38,410 --> 01:00:40,750
you can express that
tolerance as a function

1393
01:00:40,750 --> 01:00:46,210
of the variance and/or
the standard deviation,

1394
01:00:46,210 --> 01:00:48,790
rather of your mean.

1395
01:00:48,790 --> 01:00:52,750
So instead of saying 10,
you could say 2 times

1396
01:00:52,750 --> 01:00:57,010
the standard deviation of that
distribution using the equation

1397
01:00:57,010 --> 01:00:58,496
that we just calculated.

1398
01:00:58,496 --> 01:00:59,620
And that's very convenient.

1399
01:00:59,620 --> 01:01:02,100
Why would we do that?

1400
01:01:02,100 --> 01:01:04,170
Why would I want to
complicate things that way?

1401
01:01:07,068 --> 01:01:09,075
AUDIENCE: [? Outside ?]
[? of ?] a cumulative

1402
01:01:09,075 --> 01:01:10,950
GABRIEL SANCHEZ-MARTINEZ:
No, I mean, there's

1403
01:01:10,950 --> 01:01:12,690
a mathematical convenience here.

1404
01:01:12,690 --> 01:01:15,420
What is this a function of?

1405
01:01:15,420 --> 01:01:18,510
It's a function of
the standard deviation

1406
01:01:18,510 --> 01:01:22,490
of the thing you were collecting
and your sample size, right?

1407
01:01:22,490 --> 01:01:23,670
And what do we want to do?

1408
01:01:23,670 --> 01:01:25,470
We want to determine
how many things we

1409
01:01:25,470 --> 01:01:26,670
need to collect, right?

1410
01:01:26,670 --> 01:01:27,570
So here we go--

1411
01:01:27,570 --> 01:01:28,650
we have n.

1412
01:01:28,650 --> 01:01:32,280
And now we can solve for
n, we have the sample size

1413
01:01:32,280 --> 01:01:34,380
that we require for
a given tolerance.

1414
01:01:34,380 --> 01:01:37,560
So we're going to decide
what the tolerance is

1415
01:01:37,560 --> 01:01:41,100
and calculate sample size, a
minimum required sample size.

1416
01:01:41,100 --> 01:01:44,030
You can always
collect more data.

1417
01:01:44,030 --> 01:01:44,810
All right.

1418
01:01:44,810 --> 01:01:46,790
So again, to review,
this is the same equation

1419
01:01:46,790 --> 01:01:48,200
I had in the last slide.

1420
01:01:48,200 --> 01:01:51,020
You have absolutely tolerance.

1421
01:01:51,020 --> 01:01:54,740
You can express
that as a multiplier

1422
01:01:54,740 --> 01:01:59,270
times the standard
deviation of the mean.

1423
01:01:59,270 --> 01:02:02,330
And then you solve for n,
and you get this equation

1424
01:02:02,330 --> 01:02:03,260
right here.

1425
01:02:03,260 --> 01:02:06,170
t is your tolerance
and you can--

1426
01:02:06,170 --> 01:02:09,980
oh, sorry. t is the number
of standard deviations

1427
01:02:09,980 --> 01:02:11,690
from the mean.

1428
01:02:11,690 --> 01:02:14,600
d is your tolerance,
which you choose.

1429
01:02:14,600 --> 01:02:17,090
And this is something
that you know, or collect,

1430
01:02:17,090 --> 01:02:18,410
or approximate.

1431
01:02:18,410 --> 01:02:20,210
So these are all given.

1432
01:02:20,210 --> 01:02:21,510
Where does t come from?

1433
01:02:21,510 --> 01:02:24,050
Well, we said that
we're going to use the t

1434
01:02:24,050 --> 01:02:24,890
distribution, right?

1435
01:02:24,890 --> 01:02:28,220
So the t distribution
has a table--

1436
01:02:28,220 --> 01:02:30,230
or it has a certain
shape, rather.

1437
01:02:30,230 --> 01:02:32,870
And using Excel or
looking up at some table,

1438
01:02:32,870 --> 01:02:38,030
you can figure out
what t is for two times

1439
01:02:38,030 --> 01:02:40,890
the standard deviation
from the center.

1440
01:02:40,890 --> 01:02:43,820
So you can just plug it
in from Excel or from--

1441
01:02:43,820 --> 01:02:46,040
it's a property of the
distribution, essentially.

1442
01:02:46,040 --> 01:02:48,890
Once you pick a confidence
interval, you know t.

1443
01:02:48,890 --> 01:02:51,470
If you want to go to 95,
it's a certain value.

1444
01:02:51,470 --> 01:02:54,320
If you want to go to 90,
it's a different value.

1445
01:02:54,320 --> 01:02:55,460
OK.

1446
01:02:55,460 --> 01:02:57,050
When we look at
relative tolerance,

1447
01:02:57,050 --> 01:03:00,740
relative tolerance is
just absolute tolerance

1448
01:03:00,740 --> 01:03:03,890
divided by the mean that
you are collecting, correct?

1449
01:03:03,890 --> 01:03:06,890
Because instead of saying
plus or minus 10 boardings,

1450
01:03:06,890 --> 01:03:09,380
we're saying plus or
minus 5% of the mean.

1451
01:03:09,380 --> 01:03:13,730
So we just take absolute
tolerance and divide by x bar,

1452
01:03:13,730 --> 01:03:17,240
the sampling mean,
the sample mean.

1453
01:03:17,240 --> 01:03:19,040
And we solve for n again.

1454
01:03:19,040 --> 01:03:23,690
So what we have now, it looks
very similar as to the question

1455
01:03:23,690 --> 01:03:24,570
right here.

1456
01:03:24,570 --> 01:03:27,660
But now we have the mean
and the denominator.

1457
01:03:27,660 --> 01:03:30,860
OK, this quantity,
standard deviation

1458
01:03:30,860 --> 01:03:34,040
divided by mean, sample
standard deviation divided

1459
01:03:34,040 --> 01:03:38,200
by sampling mean, is called
the coefficient of variation.

1460
01:03:38,200 --> 01:03:40,930
And there's a
convenience to this.

1461
01:03:40,930 --> 01:03:44,192
And there's actually
a reason why

1462
01:03:44,192 --> 01:03:45,900
sometimes relative
tolerance is preferred

1463
01:03:45,900 --> 01:03:46,820
to absolute tolerance.

1464
01:03:46,820 --> 01:03:48,361
It's because of
this, because there's

1465
01:03:48,361 --> 01:03:52,620
a mathematically convenient
characteristic of property

1466
01:03:52,620 --> 01:03:53,760
coming out of this--

1467
01:03:53,760 --> 01:03:57,270
that you don't need to know
the standard deviation of what

1468
01:03:57,270 --> 01:04:00,300
you're collecting to figure
out your sample size.

1469
01:04:00,300 --> 01:04:02,310
We're kind of running
in circles here, right?

1470
01:04:02,310 --> 01:04:04,101
We're saying that to
determine sample size,

1471
01:04:04,101 --> 01:04:05,829
you need to know the
standard deviation.

1472
01:04:05,829 --> 01:04:07,120
Well, I haven't collected data.

1473
01:04:07,120 --> 01:04:09,009
So I don't know how
variable the data is.

1474
01:04:09,009 --> 01:04:09,800
So that's an issue.

1475
01:04:09,800 --> 01:04:11,820
Now I have to
estimate what that is.

1476
01:04:11,820 --> 01:04:15,510
It tends to happen that the
coefficient of variation

1477
01:04:15,510 --> 01:04:19,230
is a more stable property
than the variation in itself,

1478
01:04:19,230 --> 01:04:22,410
than the variance or the
standard deviation itself.

1479
01:04:22,410 --> 01:04:26,670
So you're more
likely to get away

1480
01:04:26,670 --> 01:04:29,640
with using default values for
the coefficient of variation

1481
01:04:29,640 --> 01:04:34,260
than you are with assuming a
specific standard deviation.

1482
01:04:34,260 --> 01:04:37,480
AUDIENCE: It should be noted
that it's unitless, coefficient

1483
01:04:37,480 --> 01:04:38,410
of variation.

1484
01:04:38,410 --> 01:04:40,326
GABRIEL SANCHEZ-MARTINEZ:
Yes, it is unitless.

1485
01:04:40,326 --> 01:04:41,980
Thank you.

1486
01:04:41,980 --> 01:04:42,480
OK.

1487
01:04:42,480 --> 01:04:45,210
So what happens is that relative
tolerances are typically

1488
01:04:45,210 --> 01:04:46,080
used for averages.

1489
01:04:46,080 --> 01:04:47,310
So here's an example--

1490
01:04:47,310 --> 01:04:51,760
you measured 5720
boardings plus minus 5%.

1491
01:04:51,760 --> 01:04:54,494
So if you were to get
the absolute equivalent

1492
01:04:54,494 --> 01:04:55,910
of the absolute
tolerance of that.

1493
01:04:55,910 --> 01:04:58,890
That would be 5% of 5720.

1494
01:04:58,890 --> 01:05:01,180
That would be 286 passengers.

1495
01:05:01,180 --> 01:05:03,600
That's a weird thing to report.

1496
01:05:03,600 --> 01:05:06,000
5% is more
understandable, right?

1497
01:05:06,000 --> 01:05:07,630
And it kind of makes more sense.

1498
01:05:07,630 --> 01:05:11,700
So that's what we want
naturally, anyway.

1499
01:05:11,700 --> 01:05:14,360
So as I said, the
coefficient variation

1500
01:05:14,360 --> 01:05:17,310
is typically easier to guess
than the mean and the variance

1501
01:05:17,310 --> 01:05:18,690
separately.

1502
01:05:18,690 --> 01:05:20,820
So we use that.

1503
01:05:20,820 --> 01:05:23,070
Here's an example using
the t distribution,

1504
01:05:23,070 --> 01:05:26,070
where the sample
is not large enough

1505
01:05:26,070 --> 01:05:30,280
to assume a normal distribution.

1506
01:05:30,280 --> 01:05:33,550
So we say, let's have a relative
tolerance of plus minus 5%,

1507
01:05:33,550 --> 01:05:36,120
a confidence level of
95%, and a coefficient

1508
01:05:36,120 --> 01:05:37,650
of variation of 0.3.

1509
01:05:37,650 --> 01:05:39,660
So we start out
assuming large sample,

1510
01:05:39,660 --> 01:05:42,210
and therefore degrees
of freedom is infinity.

1511
01:05:42,210 --> 01:05:44,140
We can use the
normal distribution.

1512
01:05:44,140 --> 01:05:46,860
If we look at the
normal distribution,

1513
01:05:46,860 --> 01:05:52,920
with plus minus 5%, confidence
level 95%, the t is 1.96.

1514
01:05:52,920 --> 01:05:57,030
So we look that up on a table,
or we use Excel norm dist,

1515
01:05:57,030 --> 01:05:58,440
or-- yeah.

1516
01:05:58,440 --> 01:06:02,110
t dist for t and
norm dist for normal.

1517
01:06:02,110 --> 01:06:04,860
We got 1.96.

1518
01:06:04,860 --> 01:06:06,870
We plug in the
relative tolerance,

1519
01:06:06,870 --> 01:06:08,366
the 0.3-- we get 140.

1520
01:06:08,366 --> 01:06:11,460
140 is not quite
infinity, right?

1521
01:06:11,460 --> 01:06:14,190
So if we look at 140
as a sample size,

1522
01:06:14,190 --> 01:06:16,980
that would imply that all the
degrees of freedom is 139.

1523
01:06:16,980 --> 01:06:19,410
Now we go back and
look at the t dist,

1524
01:06:19,410 --> 01:06:23,730
and we change 1.96 to the
value from the t distribution

1525
01:06:23,730 --> 01:06:26,680
for that degree of freedoms.

1526
01:06:26,680 --> 01:06:28,710
And we get 140.73.

1527
01:06:28,710 --> 01:06:32,010
So you're sort of seeing
that you were almost right.

1528
01:06:32,010 --> 01:06:35,160
140 is very large.

1529
01:06:35,160 --> 01:06:37,380
In practice, you would
just round up a little bit

1530
01:06:37,380 --> 01:06:40,800
and get a nice round number, and
you would even play with this

1531
01:06:40,800 --> 01:06:43,860
once you're looking at planning
who you're going to send out

1532
01:06:43,860 --> 01:06:45,780
and how many hours
you're going to collect.

1533
01:06:45,780 --> 01:06:48,974
You want to get at
least 141, but if you're

1534
01:06:48,974 --> 01:06:51,390
going to have people in units
of eight hours, for example,

1535
01:06:51,390 --> 01:06:54,560
or units of four hours, then you
might as well finish the batch

1536
01:06:54,560 --> 01:06:56,250
for four hours, the last one.

1537
01:06:56,250 --> 01:07:00,500
Maybe you'll get
150, 160 from that.

1538
01:07:00,500 --> 01:07:02,740
Here's an example
of that equation

1539
01:07:02,740 --> 01:07:08,260
with different assumptions
of confidence and tolerance.

1540
01:07:08,260 --> 01:07:11,320
And so we're using
90% confidence,

1541
01:07:11,320 --> 01:07:15,410
and we're assuming a
certain sample size here.

1542
01:07:15,410 --> 01:07:19,000
So you can see that, as the
tolerance decreases, which

1543
01:07:19,000 --> 01:07:22,150
means that you require
a greater accuracy

1544
01:07:22,150 --> 01:07:25,060
for different
coefficients of variation,

1545
01:07:25,060 --> 01:07:26,670
the sample size can
get really large.

1546
01:07:26,670 --> 01:07:29,200
So if your data is
not very variable,

1547
01:07:29,200 --> 01:07:31,160
then you can sample
just a few trips.

1548
01:07:31,160 --> 01:07:33,490
And you know because
they don't vary

1549
01:07:33,490 --> 01:07:35,440
that much what the mean is.

1550
01:07:35,440 --> 01:07:37,540
But if there's a lot of
variability across strips,

1551
01:07:37,540 --> 01:07:38,440
then you need more.

1552
01:07:38,440 --> 01:07:43,630
So that's what you see as you
go down the rows on this table.

1553
01:07:43,630 --> 01:07:44,860
Here we have tolerance.

1554
01:07:44,860 --> 01:07:51,850
If you only have to be 50%
accurate, plus minus 50%,

1555
01:07:51,850 --> 01:07:54,160
then you don't have to
collect that much data.

1556
01:07:54,160 --> 01:07:56,410
If you want to be
more precise, and you

1557
01:07:56,410 --> 01:08:00,860
want to say plus minus 5%, then
you need a bigger sample size,

1558
01:08:00,860 --> 01:08:01,870
right?

1559
01:08:01,870 --> 01:08:03,720
OK.

1560
01:08:03,720 --> 01:08:05,940
Proportions-- and the
homework, actually,

1561
01:08:05,940 --> 01:08:08,871
is based on proportions,
so this is important.

1562
01:08:08,871 --> 01:08:10,620
Consider something, a
group of passengers,

1563
01:08:10,620 --> 01:08:13,740
to estimate the proportion of
passengers who are students.

1564
01:08:13,740 --> 01:08:16,109
So from probability,
when you are

1565
01:08:16,109 --> 01:08:17,880
looking at an event
that can either

1566
01:08:17,880 --> 01:08:20,830
be 0 or 1, or black or white--

1567
01:08:20,830 --> 01:08:24,540
in this case, students
or non-students--

1568
01:08:24,540 --> 01:08:27,240
there's a certain probability
that that person is a student,

1569
01:08:27,240 --> 01:08:27,739
right?

1570
01:08:27,739 --> 01:08:29,850
And what you want to
estimate is that probability

1571
01:08:29,850 --> 01:08:31,920
or, in other words, what
percent of the things

1572
01:08:31,920 --> 01:08:34,290
you observe are students.

1573
01:08:37,020 --> 01:08:40,229
So from the properties of
the Bernoulli distribution,

1574
01:08:40,229 --> 01:08:43,200
the variance is p
times 1 minus p.

1575
01:08:43,200 --> 01:08:47,160
So if everybody is a student,
or nobody is a student,

1576
01:08:47,160 --> 01:08:49,800
either way there's no
variability, right?

1577
01:08:49,800 --> 01:08:55,319
So you would have 1 times 1
minus 1, 1 times 0, 0-- no

1578
01:08:55,319 --> 01:08:56,430
variability.

1579
01:08:56,430 --> 01:08:59,609
Though at the peak variability,
the highest variance

1580
01:08:59,609 --> 01:09:02,910
of this distribution, is
when 50% of your people

1581
01:09:02,910 --> 01:09:08,340
are students, so 0.5
times 1 minus 0.5, 0.25.

1582
01:09:08,340 --> 01:09:10,859
That's the highest variance, OK?

1583
01:09:10,859 --> 01:09:12,792
So the tolerance is
typically specified

1584
01:09:12,792 --> 01:09:15,000
in absolute terms when you're
estimating proportions,

1585
01:09:15,000 --> 01:09:18,420
because the proportion
is in itself a percent.

1586
01:09:18,420 --> 01:09:22,470
So you use absolute tolerance.

1587
01:09:22,470 --> 01:09:28,859
And you just substitute,
essentially, this variance.

1588
01:09:28,859 --> 01:09:31,710
You put in the variance of
the Bernoulli distribution,

1589
01:09:31,710 --> 01:09:33,300
which is p times 1 minus p.

1590
01:09:33,300 --> 01:09:36,205
And that's how you get the
sampling equation, sample size

1591
01:09:36,205 --> 01:09:37,080
requirement equation.

1592
01:09:41,950 --> 01:09:43,180
Here's a problem.

1593
01:09:43,180 --> 01:09:47,899
We don't know in advance what
the proportion will be, right?

1594
01:09:47,899 --> 01:09:50,415
And we need that to know how
many people we need to survey

1595
01:09:50,415 --> 01:09:52,540
to figure out-- or how many
trips we need to survey

1596
01:09:52,540 --> 01:09:53,649
to figure out--

1597
01:09:53,649 --> 01:09:55,066
sorry, how many
students we need--

1598
01:09:55,066 --> 01:09:56,440
how many riders
we need to survey

1599
01:09:56,440 --> 01:09:58,760
to figure out what the average
number of students are.

1600
01:09:58,760 --> 01:09:59,963
OK, so--

1601
01:09:59,963 --> 01:10:03,414
AUDIENCE: And it's also
a [INAUDIBLE] p times 1

1602
01:10:03,414 --> 01:10:05,248
minus p [INAUDIBLE] is
a constrained number.

1603
01:10:05,248 --> 01:10:07,455
GABRIEL SANCHEZ-MARTINEZ:
It is a constrained number,

1604
01:10:07,455 --> 01:10:09,417
and that's exactly
where we're going.

1605
01:10:09,417 --> 01:10:11,750
So we use something called
absolute equivalent tolerance

1606
01:10:11,750 --> 01:10:13,340
instead of absolute tolerance.

1607
01:10:13,340 --> 01:10:15,950
We assume that p is 0.5--

1608
01:10:15,950 --> 01:10:18,090
that's the maximum it could be.

1609
01:10:18,090 --> 01:10:20,750
So let's go ahead with
a worst case scenario.

1610
01:10:20,750 --> 01:10:22,830
And then what happens
with p itself?

1611
01:10:22,830 --> 01:10:27,260
Well, if your percent
is high, then you

1612
01:10:27,260 --> 01:10:29,960
can tolerate a
bigger number, right?

1613
01:10:29,960 --> 01:10:35,600
So if it's 32%, you're
probably OK with plus minus 5%.

1614
01:10:35,600 --> 01:10:39,320
If your average were
1.2, plus minus 5%

1615
01:10:39,320 --> 01:10:40,970
is not that good, right?

1616
01:10:40,970 --> 01:10:42,170
You need a higher--

1617
01:10:42,170 --> 01:10:46,220
you need a much stricter,
tighter confidence

1618
01:10:46,220 --> 01:10:47,700
interval for that.

1619
01:10:47,700 --> 01:10:51,259
So probably not good to do
plus minus 5% in that case.

1620
01:10:51,259 --> 01:10:53,550
AUDIENCE: [? Well, do ?]
[? you mean ?] you have a plus

1621
01:10:53,550 --> 01:10:55,730
minus 5% absolutely percentage?

1622
01:10:55,730 --> 01:10:56,040
GABRIEL SANCHEZ-MARTINEZ:
Absolute, yeah.

1623
01:10:56,040 --> 01:10:57,530
AUDIENCE: And you'd be
going negative [INAUDIBLE]

1624
01:10:57,530 --> 01:10:57,800
GABRIEL SANCHEZ-MARTINEZ:
Negative,

1625
01:10:57,800 --> 01:11:00,590
which is possible but
difficult to interpret.

1626
01:11:00,590 --> 01:11:04,320
AUDIENCE: Sorry, so this isn't
actually 32% plus or minus 5%

1627
01:11:04,320 --> 01:11:06,020
of 32 [INAUDIBLE]

1628
01:11:06,020 --> 01:11:06,610
GABRIEL SANCHEZ-MARTINEZ:
It is not-- yeah,

1629
01:11:06,610 --> 01:11:09,110
it's absolute tolerance, not
relative tolerance, right.

1630
01:11:09,110 --> 01:11:12,500
So what's convenient about this
is that these two factors work

1631
01:11:12,500 --> 01:11:13,490
in opposite directions.

1632
01:11:13,490 --> 01:11:19,490
So as you get bigger, as the
proportion gets closer to 50%,

1633
01:11:19,490 --> 01:11:20,690
the variance increases.

1634
01:11:20,690 --> 01:11:23,150
So oh, well, we need
a bigger sample.

1635
01:11:23,150 --> 01:11:26,630
But your tolerance
increases as well,

1636
01:11:26,630 --> 01:11:28,670
so you don't need
as big of a sample.

1637
01:11:28,670 --> 01:11:30,050
And so it's convenient.

1638
01:11:30,050 --> 01:11:33,500
And the practical solution
is assume p is 0.5

1639
01:11:33,500 --> 01:11:37,070
and work in terms of absolute
equivalent tolerance.

1640
01:11:37,070 --> 01:11:40,160
So you pick a tolerance
under the assumption

1641
01:11:40,160 --> 01:11:43,670
that our proportion is 50%.

1642
01:11:43,670 --> 01:11:46,170
And here's what happens.

1643
01:11:46,170 --> 01:11:48,800
Yeah, if the expected
proportion is 50%,

1644
01:11:48,800 --> 01:11:51,890
and you say plus minus 5
percent, what you would get

1645
01:11:51,890 --> 01:11:56,520
is this 5%, if it
turns out that p is 5%.

1646
01:11:56,520 --> 01:12:01,400
But if it worked more to the
extremes, like 5% or 95%,

1647
01:12:01,400 --> 01:12:04,970
what you would actually achieve
from having planned the survey,

1648
01:12:04,970 --> 01:12:07,580
assuming 50%, is 2.2--

1649
01:12:07,580 --> 01:12:11,690
so much better,
much more acceptable

1650
01:12:11,690 --> 01:12:14,840
to say 5% plus
minus 2.2%, right?

1651
01:12:14,840 --> 01:12:16,370
So it works out.

1652
01:12:16,370 --> 01:12:20,030
And there's a
convenient equation

1653
01:12:20,030 --> 01:12:22,640
if you assume a very large
sample, or large enough sample,

1654
01:12:22,640 --> 01:12:26,570
and you pick 95%, 0.25,
which is the variance

1655
01:12:26,570 --> 01:12:30,320
times the normal
distribution t squared

1656
01:12:30,320 --> 01:12:32,280
is 0.96, which is almost 1.

1657
01:12:32,280 --> 01:12:33,530
So then you get this equation.

1658
01:12:33,530 --> 01:12:35,654
You take 1, you divide
it by the tolerance

1659
01:12:35,654 --> 01:12:37,820
that you want, your equivalent
tolerance, and that's

1660
01:12:37,820 --> 01:12:38,940
your sample size.

1661
01:12:38,940 --> 01:12:42,950
So it doesn't depend on anything
about the data in itself.

1662
01:12:42,950 --> 01:12:46,490
You just say if I want, on
whatever I'm collecting,

1663
01:12:46,490 --> 01:12:48,270
whatever proportion
I'm collecting,

1664
01:12:48,270 --> 01:12:51,620
a 5% absolute
equivalent tolerance,

1665
01:12:51,620 --> 01:12:57,190
then I need 400
surveys to be answered.

1666
01:12:57,190 --> 01:12:58,296
Yeah?

1667
01:12:58,296 --> 01:13:03,152
AUDIENCE: So this
assumes a random--

1668
01:13:03,152 --> 01:13:05,110
GABRIEL SANCHEZ-MARTINEZ:
Simple random sample.

1669
01:13:05,110 --> 01:13:05,750
AUDIENCE: [INAUDIBLE]

1670
01:13:05,750 --> 01:13:07,190
GABRIEL SANCHEZ-MARTINEZ:
Yes, a simple random sample.

1671
01:13:07,190 --> 01:13:08,648
So you would increase
these numbers

1672
01:13:08,648 --> 01:13:11,050
if you are using
cluster sampling

1673
01:13:11,050 --> 01:13:13,330
to account for correlation.

1674
01:13:13,330 --> 01:13:16,830
You would have to increase
them if you're giving people

1675
01:13:16,830 --> 01:13:19,330
a survey, and not all of them
answer the survey, because you

1676
01:13:19,330 --> 01:13:22,240
need 400 surveys answered.

1677
01:13:22,240 --> 01:13:24,610
So if only half of the
people answer the survey,

1678
01:13:24,610 --> 01:13:27,010
then you need to
distribute 800 surveys.

1679
01:13:27,010 --> 01:13:28,962
AUDIENCE: Do you
recommend calculating also

1680
01:13:28,962 --> 01:13:32,070
that the standard error after
this so that [INAUDIBLE]

1681
01:13:32,070 --> 01:13:32,570
make sure?

1682
01:13:32,570 --> 01:13:33,530
GABRIEL SANCHEZ-MARTINEZ:
Absolutely, yeah.

1683
01:13:33,530 --> 01:13:35,810
You want to go back and
check with the standard error

1684
01:13:35,810 --> 01:13:38,330
and when your confidence
interval is and see

1685
01:13:38,330 --> 01:13:39,830
if you meet it or
if you need to add

1686
01:13:39,830 --> 01:13:41,570
a few days of data collection.

1687
01:13:41,570 --> 01:13:42,360
AUDIENCE: Right.

1688
01:13:42,360 --> 01:13:43,651
GABRIEL SANCHEZ-MARTINEZ: Yeah.

1689
01:13:43,651 --> 01:13:48,740
OK, so with proportions, you
need a very large sample size

1690
01:13:48,740 --> 01:13:51,590
to estimate a proportion
if you want accuracy.

1691
01:13:51,590 --> 01:13:54,800
If you say absolutely
equivalent intolerance of 4%,

1692
01:13:54,800 --> 01:13:56,900
then you need 600.

1693
01:13:56,900 --> 01:14:00,020
That's a big number, so it
just gives you an idea of that.

1694
01:14:00,020 --> 01:14:02,390
If you get greedy
with the tolerance,

1695
01:14:02,390 --> 01:14:08,580
you have to pay for the
surveyors to go out.

1696
01:14:08,580 --> 01:14:11,090
OK.

1697
01:14:11,090 --> 01:14:14,510
So the process is you determine
the needed sample size

1698
01:14:14,510 --> 01:14:18,140
just with the discussion of the
equations that we discussed.

1699
01:14:18,140 --> 01:14:19,850
Then you multiply
the sample sizes.

1700
01:14:23,675 --> 01:14:25,550
If you're using stratified
sampling or if you

1701
01:14:25,550 --> 01:14:28,280
have questions that
have multiple variables,

1702
01:14:28,280 --> 01:14:30,920
you need to then make sure
that you achieve that sample

1703
01:14:30,920 --> 01:14:34,200
size for each combination of
things that you're measuring.

1704
01:14:34,200 --> 01:14:36,830
So if you're, for example,
looking at not just

1705
01:14:36,830 --> 01:14:42,880
boardings, but proportion of
passengers that are car-owning,

1706
01:14:42,880 --> 01:14:43,810
who are pleased.

1707
01:14:43,810 --> 01:14:46,990
So you could just independently
measure pleased, independently

1708
01:14:46,990 --> 01:14:52,840
measure passengers
who own a car.

1709
01:14:52,840 --> 01:14:55,960
And you might have the
tolerance you need on each one,

1710
01:14:55,960 --> 01:14:57,910
but if you want the
combination of that,

1711
01:14:57,910 --> 01:14:59,650
now you need a higher
sample, because you

1712
01:14:59,650 --> 01:15:03,610
need that number for the
combination of those things.

1713
01:15:03,610 --> 01:15:05,230
Then there's a
clustering effect,

1714
01:15:05,230 --> 01:15:06,862
so a typical thing
if you're doing

1715
01:15:06,862 --> 01:15:08,820
the clustering of a whole
vehicle of passengers

1716
01:15:08,820 --> 01:15:12,190
is to multiply by 4.

1717
01:15:12,190 --> 01:15:15,010
And then for things like OD
matrices, the rule of thumb

1718
01:15:15,010 --> 01:15:17,927
is 20 times the number of cells.

1719
01:15:17,927 --> 01:15:18,760
What does that mean?

1720
01:15:18,760 --> 01:15:20,554
That if your OD matrix
is quite aggregate,

1721
01:15:20,554 --> 01:15:21,970
and it's at the
segment level-- so

1722
01:15:21,970 --> 01:15:24,700
say you divide a root
into two segments,

1723
01:15:24,700 --> 01:15:27,190
then your OD matrix
has four cells.

1724
01:15:27,190 --> 01:15:30,310
Four cells times 20, that's how
many people you have to survey.

1725
01:15:30,310 --> 01:15:33,269
If you do error
at the stop level,

1726
01:15:33,269 --> 01:15:35,560
then you have many more stops
and, therefore, many more

1727
01:15:35,560 --> 01:15:39,465
cells and, therefore, a
much higher sample size.

1728
01:15:39,465 --> 01:15:41,090
If you have a response
rate that is not

1729
01:15:41,090 --> 01:15:43,360
100%, which is always
the case, then you

1730
01:15:43,360 --> 01:15:46,480
have to expand by 1 minus that
in the reciprocal-- sorry, 1

1731
01:15:46,480 --> 01:15:48,080
over that in the reciprocal.

1732
01:15:48,080 --> 01:15:49,760
And then you get a
very large number,

1733
01:15:49,760 --> 01:15:51,860
and you say I don't have
the budget for that.

1734
01:15:51,860 --> 01:15:57,460
And you have to make tradeoffs
and figure out what you can do.

1735
01:15:57,460 --> 01:15:59,590
And maybe you have to--

1736
01:15:59,590 --> 01:16:01,750
maybe you can't collect
this combination

1737
01:16:01,750 --> 01:16:03,160
and know that accurately, right?

1738
01:16:03,160 --> 01:16:07,500
So you revise your expectations.

1739
01:16:07,500 --> 01:16:11,265
OK, with response
rates, you are concerned

1740
01:16:11,265 --> 01:16:12,640
with getting the
correct answers.

1741
01:16:12,640 --> 01:16:14,740
You also want to be getting
a high response rate.

1742
01:16:14,740 --> 01:16:17,281
If you don't get a high response
rate, there might be a bias.

1743
01:16:17,281 --> 01:16:19,730
So you have to worry about that.

1744
01:16:19,730 --> 01:16:21,250
If you have low
response rates, that

1745
01:16:21,250 --> 01:16:23,000
means you need to
distribute more surveys,

1746
01:16:23,000 --> 01:16:24,190
and that costs money.

1747
01:16:24,190 --> 01:16:26,240
And there's the bias
that I just mentioned,

1748
01:16:26,240 --> 01:16:29,980
so people who don't respond may
not be responding for a reason.

1749
01:16:29,980 --> 01:16:32,560
And then done that
might bias your results.

1750
01:16:32,560 --> 01:16:34,990
And that might make you
decide something in planning

1751
01:16:34,990 --> 01:16:38,570
that is not the right decision
based on what actually happens.

1752
01:16:38,570 --> 01:16:41,410
So we call that the
non-response bias.

1753
01:16:41,410 --> 01:16:43,099
OK, so what happens?

1754
01:16:43,099 --> 01:16:44,890
People who don't respond
might be different

1755
01:16:44,890 --> 01:16:47,015
or might have responded
differently to the question

1756
01:16:47,015 --> 01:16:47,950
had they responded.

1757
01:16:47,950 --> 01:16:50,120
So here's some examples.

1758
01:16:50,120 --> 01:16:52,660
If you're surveying
people who are standing,

1759
01:16:52,660 --> 01:16:54,220
they are less comfortable.

1760
01:16:54,220 --> 01:16:57,280
And maybe it's a crowded bus--
they are less comfortable.

1761
01:16:57,280 --> 01:17:00,100
Or maybe they're getting
off one of those stops that

1762
01:17:00,100 --> 01:17:03,250
is coming up, so they are
less likely to have the time

1763
01:17:03,250 --> 01:17:05,020
to respond to your survey.

1764
01:17:05,020 --> 01:17:07,570
People with low
literacy, teenagers,

1765
01:17:07,570 --> 01:17:10,420
people who don't
speak the language,

1766
01:17:10,420 --> 01:17:11,870
are less likely to respond.

1767
01:17:11,870 --> 01:17:14,720
And they might have
different travel patterns.

1768
01:17:14,720 --> 01:17:16,710
So if you understand
those things,

1769
01:17:16,710 --> 01:17:18,392
and you get lower
samples for them,

1770
01:17:18,392 --> 01:17:20,350
you might be able to do
some sort of correction

1771
01:17:20,350 --> 01:17:21,670
to those biases.

1772
01:17:21,670 --> 01:17:23,740
But you have to pay attention.

1773
01:17:23,740 --> 01:17:25,390
How do you improve
your response rate?

1774
01:17:25,390 --> 01:17:28,150
Well you can make your
questions shorter.

1775
01:17:28,150 --> 01:17:29,950
You can do a quick oral survey.

1776
01:17:29,950 --> 01:17:32,890
That's what we're going
to do for this homework.

1777
01:17:32,890 --> 01:17:36,550
You can try to get information
from automatic sources whenever

1778
01:17:36,550 --> 01:17:37,100
possible.

1779
01:17:37,100 --> 01:17:42,100
So if you have an AFC system,
let's not collect boardings,

1780
01:17:42,100 --> 01:17:45,100
because we know that.

1781
01:17:45,100 --> 01:17:47,650
And then of course some
training, and just being kind,

1782
01:17:47,650 --> 01:17:51,610
and having supervision
helps a lot.

1783
01:17:51,610 --> 01:17:53,650
OK, here's some
suggested tolerances

1784
01:17:53,650 --> 01:17:55,340
for different things.

1785
01:17:55,340 --> 01:17:58,420
So we're looking here at
boardings or the peak load.

1786
01:17:58,420 --> 01:18:00,580
And you see here that
the suggested tolerance

1787
01:18:00,580 --> 01:18:05,290
is 30%, plus minus 30%, when
you have a route with one

1788
01:18:05,290 --> 01:18:05,980
to three buses.

1789
01:18:05,980 --> 01:18:07,270
And then as you have
more and more buses,

1790
01:18:07,270 --> 01:18:08,380
the tolerance decreases.

1791
01:18:08,380 --> 01:18:11,920
That means you require
a larger sample.

1792
01:18:11,920 --> 01:18:14,670
Why is that?

1793
01:18:14,670 --> 01:18:16,530
Why do you need a
bigger sample if you

1794
01:18:16,530 --> 01:18:20,238
have a route with more buses?

1795
01:18:20,238 --> 01:18:23,560
AUDIENCE: You're less likely
to sample a different bus.

1796
01:18:23,560 --> 01:18:26,830
GABRIEL SANCHEZ-MARTINEZ: Yes,
and when you have higher--

1797
01:18:26,830 --> 01:18:30,112
when you have more buses, you
tend to have higher frequency.

1798
01:18:30,112 --> 01:18:30,820
There's bunching.

1799
01:18:30,820 --> 01:18:35,260
OK, so if you then survey
loads, for example,

1800
01:18:35,260 --> 01:18:38,590
and you only get a few
because of the bunching effect

1801
01:18:38,590 --> 01:18:40,380
and because there
are more buses,

1802
01:18:40,380 --> 01:18:42,910
and you're observing a
smaller percentage of them

1803
01:18:42,910 --> 01:18:45,490
for a given time
period, say, you're

1804
01:18:45,490 --> 01:18:48,760
less likely to have observed
the bus that was really crowded,

1805
01:18:48,760 --> 01:18:49,330
right?

1806
01:18:49,330 --> 01:18:52,390
So that means that you need
to decrease your tolerance.

1807
01:18:52,390 --> 01:18:55,490
And therefore, it's more
expensive to survey that.

1808
01:18:55,490 --> 01:18:56,350
OK, good.

1809
01:18:56,350 --> 01:19:00,220
Trip time-- 10% for routes
with less than 20 minutes,

1810
01:19:00,220 --> 01:19:03,490
5% with routes of
greater than 20 minutes.

1811
01:19:03,490 --> 01:19:06,100
Similar concept if you have
greater than 20 minutes--

1812
01:19:09,540 --> 01:19:11,170
there can be just
more variability,

1813
01:19:11,170 --> 01:19:14,860
and you really want
to get that right.

1814
01:19:14,860 --> 01:19:16,900
When you have less
than 20 minutes,

1815
01:19:16,900 --> 01:19:20,080
your decision on
cycle times and things

1816
01:19:20,080 --> 01:19:22,960
like this are not going to have
as much impact on the fleet

1817
01:19:22,960 --> 01:19:24,550
size that you require.

1818
01:19:24,550 --> 01:19:31,720
As you get bigger running
times, a small percentage change

1819
01:19:31,720 --> 01:19:34,600
in the mean could
influence how many buses

1820
01:19:34,600 --> 01:19:37,750
you need to dedicate to
that and the cost of running

1821
01:19:37,750 --> 01:19:39,990
that service.

1822
01:19:39,990 --> 01:19:43,610
On-time performance-- 10%
absolute equivalent tolerance.

1823
01:19:43,610 --> 01:19:47,260
These are typical values-- don't
take them as gospel, please.

1824
01:19:47,260 --> 01:19:49,530
And these are for reporting,
not for anything that's

1825
01:19:49,530 --> 01:19:51,750
very critical for operations.

1826
01:19:51,750 --> 01:19:53,560
Some of them are.

1827
01:19:53,560 --> 01:19:56,820
Yeah, 30% at least, I would
say, is for reporting.

1828
01:19:56,820 --> 01:20:00,840
I wouldn't make any
critical decisions with 30%.

1829
01:20:00,840 --> 01:20:04,410
On-time performance-- we're
talking here about whether

1830
01:20:04,410 --> 01:20:07,060
a trip is on time
or not on time--

1831
01:20:07,060 --> 01:20:08,970
so Bernoulli trials, right?

1832
01:20:08,970 --> 01:20:10,650
And there's a
proportion of trips

1833
01:20:10,650 --> 01:20:15,120
that are on time, and what
we do is that, we essentially

1834
01:20:15,120 --> 01:20:19,710
say plus-- if we say plus
minus 10%, then we're saying

1835
01:20:19,710 --> 01:20:22,980
that the sample size
should be 1 over 0.1.

1836
01:20:22,980 --> 01:20:23,480
Yeah.

1837
01:20:26,330 --> 01:20:28,160
All right, default
coefficient-- these

1838
01:20:28,160 --> 01:20:29,576
are default values
for coefficient

1839
01:20:29,576 --> 01:20:30,950
of variation of key data items.

1840
01:20:30,950 --> 01:20:33,170
Ideally, you have your
own data that you look at,

1841
01:20:33,170 --> 01:20:34,790
and you don't resort to this.

1842
01:20:34,790 --> 01:20:37,910
But if you ever find
yourself in a situation

1843
01:20:37,910 --> 01:20:40,470
where you need to start
out with something.

1844
01:20:40,470 --> 01:20:45,440
Here are some based on studies
that previous [AUDIO OUT] They

1845
01:20:45,440 --> 01:20:48,590
took different routes
and looked at loads

1846
01:20:48,590 --> 01:20:51,440
and running times for
different time periods

1847
01:20:51,440 --> 01:20:53,570
and found what the coefficients
of variations were.

1848
01:20:53,570 --> 01:20:56,420
And here they are on a
table for you to use.

1849
01:21:00,947 --> 01:21:03,530
In the interest of time, since
I want to discuss the homework,

1850
01:21:03,530 --> 01:21:05,750
I'm going to stop
here with slide 25.

1851
01:21:05,750 --> 01:21:11,480
And I'm going to not cover
the whole process, which

1852
01:21:11,480 --> 01:21:14,880
includes the monitoring phase.

1853
01:21:14,880 --> 01:21:17,790
And in this slide
here, we have how you

1854
01:21:17,790 --> 01:21:20,010
establish conversion factor.

1855
01:21:20,010 --> 01:21:23,870
The conversion factor in
itself has a variance.

1856
01:21:23,870 --> 01:21:26,130
So there's some uncertainty
about the relationship

1857
01:21:26,130 --> 01:21:31,380
that you estimate between
your baseline data item

1858
01:21:31,380 --> 01:21:33,550
and your auxiliary data item.

1859
01:21:33,550 --> 01:21:37,210
So you need to consider
that in your sample size.

1860
01:21:37,210 --> 01:21:39,720
And here are some tables
with some examples of what

1861
01:21:39,720 --> 01:21:44,130
happens when you require
different-- well, when you're

1862
01:21:44,130 --> 01:21:47,940
variability of or
your coefficient

1863
01:21:47,940 --> 01:21:52,650
of variation of your
relationship increases

1864
01:21:52,650 --> 01:21:54,390
or decreases.

1865
01:21:54,390 --> 01:21:56,430
OK, let's look at the homework.

1866
01:21:56,430 --> 01:21:59,893
I really want to use these
last five minutes for that.

1867
01:21:59,893 --> 01:22:07,560
So please take one and pass.

1868
01:22:07,560 --> 01:22:12,300
OK, so the MBTA, there's
a proposal here in Boston

1869
01:22:12,300 --> 01:22:14,850
of taking Route 70 and 70A--

1870
01:22:14,850 --> 01:22:17,370
they run through
Waltham, and they

1871
01:22:17,370 --> 01:22:20,130
go into around Central Square.

1872
01:22:20,130 --> 01:22:23,690
And some people are saying those
two routes should be extended

1873
01:22:23,690 --> 01:22:28,430
to Kendall Square,
because a lot of people

1874
01:22:28,430 --> 01:22:31,700
are actually going to MIT, or
Kendall Square, or the Kendall

1875
01:22:31,700 --> 01:22:33,890
Square area--

1876
01:22:33,890 --> 01:22:37,280
not just Kendall Square Station,
but the whole area around.

1877
01:22:37,280 --> 01:22:39,350
So if it's true, A
lot of people could

1878
01:22:39,350 --> 01:22:40,670
benefit from that extension.

1879
01:22:40,670 --> 01:22:41,670
And we don't know.

1880
01:22:41,670 --> 01:22:43,140
So what are you going to do?

1881
01:22:43,140 --> 01:22:45,620
You're going to go
to a specific stop

1882
01:22:45,620 --> 01:22:48,620
where it is very likely that
the people who would be going

1883
01:22:48,620 --> 01:22:52,430
to MIT or those areas of Kendall
Square that would benefit

1884
01:22:52,430 --> 01:22:55,040
from this extension
would alight,

1885
01:22:55,040 --> 01:22:57,140
and you're going to
ask people, would you

1886
01:22:57,140 --> 01:23:01,250
have stayed on your bus
if this bus had continued

1887
01:23:01,250 --> 01:23:02,960
to MIT and Kendall Square?

1888
01:23:02,960 --> 01:23:06,635
It's a simple oral survey, yes
or no question, one question.

1889
01:23:06,635 --> 01:23:08,510
You're going to work in
teams of four people.

1890
01:23:13,130 --> 01:23:16,670
The stop that you're going
to station yourself in

1891
01:23:16,670 --> 01:23:18,296
is shown in figure 3.

1892
01:23:21,230 --> 01:23:23,360
And you're going to collect
data for the AM peak,

1893
01:23:23,360 --> 01:23:25,760
from 7:30 to 9:30.

1894
01:23:25,760 --> 01:23:27,320
You pick the day.

1895
01:23:27,320 --> 01:23:29,090
The teams are
assigned on Stellar,

1896
01:23:29,090 --> 01:23:32,150
so please log into Stellar
and see what your team is

1897
01:23:32,150 --> 01:23:34,910
and coordinate with
them to pick a day.

1898
01:23:34,910 --> 01:23:37,580
And tell me what that
day is, because--

1899
01:23:37,580 --> 01:23:39,410
actually, right after
class, I'm going

1900
01:23:39,410 --> 01:23:43,370
to set up a shared spreadsheet
that you can all access.

1901
01:23:43,370 --> 01:23:46,519
And just go into that
spreadsheet and pick a day.

1902
01:23:46,519 --> 01:23:48,560
I'm going to put all the
days that are available,

1903
01:23:48,560 --> 01:23:51,170
and you can say team
1, team 2, et cetera.

1904
01:23:51,170 --> 01:23:54,410
Make sure that two teams
don't go on the same day.

1905
01:23:54,410 --> 01:23:56,400
We want data from
different days.

1906
01:23:56,400 --> 01:23:58,400
And you're going to all
bring that data together

1907
01:23:58,400 --> 01:24:00,650
in that same
spreadsheet, and there

1908
01:24:00,650 --> 01:24:03,650
are some questions
for you to analyze

1909
01:24:03,650 --> 01:24:06,230
the data that you collected,
all of the class collected

1910
01:24:06,230 --> 01:24:08,540
together.

1911
01:24:08,540 --> 01:24:12,980
You're measuring the
percent of people who would

1912
01:24:12,980 --> 01:24:14,450
have stayed on the bus, right?

1913
01:24:14,450 --> 01:24:18,730
So it's a proportion.

1914
01:24:18,730 --> 01:24:22,960
And one submission per team
in PDF format to Stellar.

1915
01:24:22,960 --> 01:24:26,410
This is due March
7, but in order

1916
01:24:26,410 --> 01:24:28,830
to leave you enough
time to do the analysis,

1917
01:24:28,830 --> 01:24:31,590
the data collection efforts
should be done by February 28.

1918
01:24:31,590 --> 01:24:37,150
So please submit your data by
the end of Tuesday, February 28

1919
01:24:37,150 --> 01:24:41,230
at midnight, say, or sometime
before the beginning of March

1920
01:24:41,230 --> 01:24:43,300
in the morning,
where a person would

1921
01:24:43,300 --> 01:24:45,100
be trying to analyze your data.

1922
01:24:48,510 --> 01:24:51,280
OK, if you have
questions, let me know.

1923
01:24:51,280 --> 01:24:56,250
And if not, have fun.

1924
01:24:56,250 --> 01:24:58,910
Remember that assignment
1 is due Thursday.

1925
01:25:01,531 --> 01:25:02,030
Eric?

1926
01:25:02,030 --> 01:25:03,822
AUDIENCE: Just the one question:
[? is that ?] [? this is ?]

1927
01:25:03,822 --> 01:25:06,140
going to miss anyone who is
transferred to the Red Line

1928
01:25:06,140 --> 01:25:07,746
to then go to Kendall Square.

1929
01:25:07,746 --> 01:25:09,620
GABRIEL SANCHEZ-MARTINEZ:
And going back to--

1930
01:25:09,620 --> 01:25:10,120
let's see.

1931
01:25:14,430 --> 01:25:17,660
I forget where I had it.

1932
01:25:17,660 --> 01:25:20,630
Well, I guess what I-- there
was a point I made earlier

1933
01:25:20,630 --> 01:25:23,730
where we can measure that from
automatically collected data,

1934
01:25:23,730 --> 01:25:24,230
right?

1935
01:25:24,230 --> 01:25:24,950
AUDIENCE: OK.

1936
01:25:24,950 --> 01:25:25,610
GABRIEL SANCHEZ-MARTINEZ:
Does that make sense?

1937
01:25:25,610 --> 01:25:27,860
AUDIENCE: Yeah, people who
[? car up ?] come from 70.

1938
01:25:27,860 --> 01:25:29,235
GABRIEL
SANCHEZ-MARTINEZ: So if I

1939
01:25:29,235 --> 01:25:32,410
see you tapping of
the 70 or the 70A,

1940
01:25:32,410 --> 01:25:35,480
and then I see you
tapping at Central Square,

1941
01:25:35,480 --> 01:25:37,790
I can infer that you
were using the service

1942
01:25:37,790 --> 01:25:40,790
to transfer to Central Square.

1943
01:25:40,790 --> 01:25:42,950
And then we'll
cover ODX, which is

1944
01:25:42,950 --> 01:25:44,480
an inference model
for destinations

1945
01:25:44,480 --> 01:25:46,210
later in this course.

1946
01:25:46,210 --> 01:25:51,170
But looking at the sequence
of taps, I can infer--

1947
01:25:51,170 --> 01:25:53,420
we can infer-- what the
destination of that bus trip

1948
01:25:53,420 --> 01:25:53,919
was.

1949
01:25:53,919 --> 01:25:55,720
We can infer that
it was the stop that

1950
01:25:55,720 --> 01:25:57,440
was closest to Central.

1951
01:25:57,440 --> 01:26:00,710
And later that day,
presumably the person

1952
01:26:00,710 --> 01:26:04,310
who might be going to Kendall
Square Station after work taps

1953
01:26:04,310 --> 01:26:05,199
to Kendall Square.

1954
01:26:05,199 --> 01:26:07,490
So I might think, oh, he took
the Red Line from Central

1955
01:26:07,490 --> 01:26:09,500
to Kendall.

1956
01:26:09,500 --> 01:26:12,320
So I don't need to ask those
people where they're going.

1957
01:26:12,320 --> 01:26:14,880
And anyway, they might not
care about this extension.

1958
01:26:14,880 --> 01:26:17,570
So we're going to stand
on the bus stop that

1959
01:26:17,570 --> 01:26:21,590
is after Central Square and see
where those people are going

1960
01:26:21,590 --> 01:26:25,350
and whether they would
have stayed on that bus.

1961
01:26:25,350 --> 01:26:28,119
AUDIENCE: Is this an
actual [INAUDIBLE]

1962
01:26:28,119 --> 01:26:30,410
GABRIEL SANCHEZ-MARTINEZ:
Some people are proposing it.

1963
01:26:30,410 --> 01:26:33,060
It is a real proposal.

1964
01:26:33,060 --> 01:26:35,230
The MBTA is a big organization.

1965
01:26:35,230 --> 01:26:41,490
So I can't say that the
MBTA wants to do this

1966
01:26:41,490 --> 01:26:43,210
or doesn't want to do this.

1967
01:26:43,210 --> 01:26:45,090
But some people are interested.

1968
01:26:45,090 --> 01:26:48,630
And it will get looked into.

1969
01:26:48,630 --> 01:26:50,770
So it's useful.

1970
01:26:50,770 --> 01:26:53,195
AUDIENCE: [? Can ?] [? we ?]
[? share ?] [INAUDIBLE]

1971
01:26:53,195 --> 01:26:56,105
GABRIEL SANCHEZ-MARTINEZ:
Yeah, why not?

1972
01:26:56,105 --> 01:26:58,045
AUDIENCE: [INAUDIBLE]

1973
01:26:58,045 --> 01:27:00,470
GABRIEL SANCHEZ-MARTINEZ: Yeah.

1974
01:27:00,470 --> 01:27:03,320
And I guess one other
thing that I-- yeah,

1975
01:27:03,320 --> 01:27:06,250
so we're going to
probably make of this

1976
01:27:06,250 --> 01:27:08,594
like a theme of assignments.

1977
01:27:08,594 --> 01:27:10,760
So there's going to be
another assignment on surface

1978
01:27:10,760 --> 01:27:12,510
planning, operations planning.

1979
01:27:12,510 --> 01:27:15,160
So we're going to start looking
at this combination of Route 70

1980
01:27:15,160 --> 01:27:19,760
and 70A, and we're going
to essentially make

1981
01:27:19,760 --> 01:27:22,520
a thread of this and do
some serious planning

1982
01:27:22,520 --> 01:27:26,300
on some scenarios where the 70
and the 70A could be merged.

1983
01:27:26,300 --> 01:27:29,860
And they could maybe be
terminated a little--

1984
01:27:29,860 --> 01:27:32,810
yeah, we'll make some
changes to the service plan

1985
01:27:32,810 --> 01:27:34,440
under some
hypothetical scenarios.

1986
01:27:34,440 --> 01:27:38,840
And you'll get a chance to do
an operations plan on these.

1987
01:27:38,840 --> 01:27:41,550
And then the last homework
will be on policy,

1988
01:27:41,550 --> 01:27:44,170
so there might be
some policy questions

1989
01:27:44,170 --> 01:27:47,930
that I have in mind about
what we could do about

1990
01:27:47,930 --> 01:27:52,640
service outside, on the outer
parts of the 70 and 70A.

1991
01:27:57,440 --> 01:27:59,290
All right?