1
00:00:00,060 --> 00:00:01,740
The mathematics of
the correlation

2
00:00:01,740 --> 00:00:03,810
coefficient are important.

3
00:00:03,810 --> 00:00:06,410
But it is perhaps more important
to be able to

4
00:00:06,410 --> 00:00:08,370
interpret it correctly.

5
00:00:08,370 --> 00:00:12,400
A correlation coefficient of
let's say 0.5, tells us that

6
00:00:12,400 --> 00:00:15,790
something interesting is going
on as far as the relation of X

7
00:00:15,790 --> 00:00:17,910
and Y is concerned.

8
00:00:17,910 --> 00:00:19,300
But what exactly?

9
00:00:19,300 --> 00:00:21,570
It tells us that the two
random variables are

10
00:00:21,570 --> 00:00:23,770
associated in some sense.

11
00:00:23,770 --> 00:00:26,740
But this is often misinterpreted
to mean that

12
00:00:26,740 --> 00:00:29,950
there is a causal relation
between the two.

13
00:00:29,950 --> 00:00:32,155
But this is wrong.

14
00:00:32,155 --> 00:00:35,950
A large correlation coefficient
in general does

15
00:00:35,950 --> 00:00:39,580
not indicate that there is a
causal relation between the

16
00:00:39,580 --> 00:00:41,450
random variables.

17
00:00:41,450 --> 00:00:45,220
As an example, suppose that
X somehow quantifies the

18
00:00:45,220 --> 00:00:47,310
mathematical aptitude
of a person.

19
00:00:47,310 --> 00:00:52,290
And Y somehow quantifies the
musical ability of a person.

20
00:00:52,290 --> 00:00:55,450
In general, it has been found
that mathematical aptitude and

21
00:00:55,450 --> 00:00:57,780
musical ability are
correlated.

22
00:00:57,780 --> 00:01:00,800
People who score high on
one will score high on

23
00:01:00,800 --> 00:01:02,420
the other as well.

24
00:01:02,420 --> 00:01:04,720
Is there a causal relation?

25
00:01:04,720 --> 00:01:09,789
If you study math a lot and you
become very good at math,

26
00:01:09,789 --> 00:01:12,890
does it mean that you would
become a better musician?

27
00:01:12,890 --> 00:01:14,140
Not necessarily.

28
00:01:14,140 --> 00:01:18,210
Or if you practice the violin
day in and day out, does it

29
00:01:18,210 --> 00:01:21,510
mean that you will score better
in the math exam?

30
00:01:21,510 --> 00:01:23,360
Again, not necessarily.

31
00:01:23,360 --> 00:01:26,760
Perhaps what is going on is that
there's a certain feature

32
00:01:26,760 --> 00:01:29,630
of the human brain and when
that feature is well

33
00:01:29,630 --> 00:01:35,289
developed, then that feature
helps both in math and in

34
00:01:35,289 --> 00:01:36,750
musical ability.

35
00:01:36,750 --> 00:01:39,920
And this is a typical situation
of how a correlation

36
00:01:39,920 --> 00:01:41,940
coefficient may arise.

37
00:01:41,940 --> 00:01:44,340
That is often a correlation
coefficient that's

38
00:01:44,340 --> 00:01:48,570
significant, reflects that there
is an underlying common

39
00:01:48,570 --> 00:01:52,810
but perhaps hidden factor that
affects both of the random

40
00:01:52,810 --> 00:01:54,710
variables X and Y.

41
00:01:54,710 --> 00:01:58,640
Let's us go through a simple
numerical example that models

42
00:01:58,640 --> 00:02:00,980
a situation of this kind.

43
00:02:00,980 --> 00:02:04,440
Suppose that Z, V, and w are
independent random variables.

44
00:02:04,440 --> 00:02:07,120
And that we have two more random
variables defined by

45
00:02:07,120 --> 00:02:08,490
these relations.

46
00:02:08,490 --> 00:02:13,110
Not that there's no direct
influence from X to Y or from

47
00:02:13,110 --> 00:02:14,520
Y to X.

48
00:02:14,520 --> 00:02:16,990
But on the other hand, there's
a common underlying factor,

49
00:02:16,990 --> 00:02:20,880
this random variable Z that
affects both X and Y. Because

50
00:02:20,880 --> 00:02:25,270
of this, we expect that X and Y
will somehow have some kind

51
00:02:25,270 --> 00:02:28,000
of relation or association
between them.

52
00:02:28,000 --> 00:02:30,210
And we would like to measure
the strength of that

53
00:02:30,210 --> 00:02:31,460
association.

54
00:02:31,460 --> 00:02:34,329
The way to measure it will be
in terms of the correlation

55
00:02:34,329 --> 00:02:37,800
coefficient, which we will
now proceed to compute.

56
00:02:37,800 --> 00:02:41,890
To have a complete example in
our hands and in order to also

57
00:02:41,890 --> 00:02:44,630
keep things simple, let's
us assume that the basic

58
00:02:44,630 --> 00:02:48,540
underlying random variable Z, V,
and W all have 0 means and

59
00:02:48,540 --> 00:02:50,240
unit variances.

60
00:02:50,240 --> 00:02:52,880
And now let us take the
definition of the correlation

61
00:02:52,880 --> 00:02:56,090
coefficient and start
calculating.

62
00:02:56,090 --> 00:03:01,600
Let us look at the variance of
X. Because X is the sum of two

63
00:03:01,600 --> 00:03:05,300
independent random variables,
its variance is going to be

64
00:03:05,300 --> 00:03:08,290
the sum of those variances.

65
00:03:08,290 --> 00:03:11,370
And we have assumed that each
one of those variances is

66
00:03:11,370 --> 00:03:12,100
equal to 1.

67
00:03:12,100 --> 00:03:15,050
So the variance of
X is equal to 2.

68
00:03:15,050 --> 00:03:18,000
And that implies that the
standard deviation of X is

69
00:03:18,000 --> 00:03:20,270
equal to the square root of 2.

70
00:03:20,270 --> 00:03:23,650
By a similar argument, the
standard deviation of Y is

71
00:03:23,650 --> 00:03:26,440
also equal to the square
root of 2.

72
00:03:26,440 --> 00:03:31,350
Now, let us look at the
covariance between X and Y.

73
00:03:31,350 --> 00:03:36,660
Because X and Y have 0 means,
the covariance is just the

74
00:03:36,660 --> 00:03:40,160
expected value of the product
of the two random variables.

75
00:03:40,160 --> 00:03:42,990
And using the definition of
what these two random

76
00:03:42,990 --> 00:03:47,450
variables are, it's this
particular product here.

77
00:03:47,450 --> 00:03:51,040
We expand the product into
a sum of four terms.

78
00:03:51,040 --> 00:03:54,590
And take the expected value of
each one of the four terms.

79
00:04:04,170 --> 00:04:11,580
Which leaves us with this
particular expression here.

80
00:04:11,580 --> 00:04:15,240
Now, Z has 0 mean and
unit variance.

81
00:04:15,240 --> 00:04:18,940
Therefore, the expected value
of Z squared is equal to 1.

82
00:04:18,940 --> 00:04:20,680
How about the next term?

83
00:04:20,680 --> 00:04:22,240
V and Z are independent.

84
00:04:22,240 --> 00:04:25,140
So the expected value of the
product is the product of the

85
00:04:25,140 --> 00:04:26,210
expected values.

86
00:04:26,210 --> 00:04:30,410
But the expected values are
zero, so this term is zero.

87
00:04:30,410 --> 00:04:32,500
And with a similar argument,
the other

88
00:04:32,500 --> 00:04:35,150
terms are zero as well.

89
00:04:35,150 --> 00:04:37,970
So the co-variance
is equal to 1.

90
00:04:37,970 --> 00:04:43,350
And from this, we can conclude
our calculation and write that

91
00:04:43,350 --> 00:04:48,340
the correlation coefficient
between X and Y is equal to 1

92
00:04:48,340 --> 00:04:53,270
divided by the square root of
2 times square root of 2,

93
00:04:53,270 --> 00:04:54,520
which is 1/2.

94
00:04:56,940 --> 00:05:00,610
This example also serves to give
you a rough idea of what

95
00:05:00,610 --> 00:05:02,790
it may mean to have
a correlation

96
00:05:02,790 --> 00:05:05,320
coefficient of 1/2.

97
00:05:05,320 --> 00:05:07,510
It means that the two
random variables

98
00:05:07,510 --> 00:05:10,310
have some common elements.

99
00:05:10,310 --> 00:05:13,690
And they also have some
idiosyncratic elements.

100
00:05:13,690 --> 00:05:18,390
And these two elements are
roughly equal in weight.

101
00:05:18,390 --> 00:05:22,220
If V and W were completely
absent, the correlation

102
00:05:22,220 --> 00:05:24,405
coefficient would have been 1.

103
00:05:24,405 --> 00:05:29,900
If on the other hand V and W had
a huge variance, so as to

104
00:05:29,900 --> 00:05:33,810
completely hide the effect of
Z, then the value of the

105
00:05:33,810 --> 00:05:37,490
correlation coefficient would
have been much, much smaller

106
00:05:37,490 --> 00:05:39,480
perhaps closer to 0.

107
00:05:39,480 --> 00:05:42,260
And in the extreme case of
course where Z is completely

108
00:05:42,260 --> 00:05:45,610
absent, then X and Y are
independent, and we get a

109
00:05:45,610 --> 00:05:47,300
correlation coefficient of 0.