1
00:00:04,490 --> 00:00:07,240
Claims data offers an expansive
view of the patients health

2
00:00:07,240 --> 00:00:08,230
history.

3
00:00:08,230 --> 00:00:10,850
Specifically, claims
data include information

4
00:00:10,850 --> 00:00:14,710
on demographics, medical
history, and medications.

5
00:00:14,710 --> 00:00:17,240
They offer insights
regarding a patient's risk.

6
00:00:17,240 --> 00:00:21,710
And as I will demonstrate,
may reveal indicative signals

7
00:00:21,710 --> 00:00:23,050
and patterns.

8
00:00:23,050 --> 00:00:25,980
We'll use health
insurance claims

9
00:00:25,980 --> 00:00:29,690
filed for about 7,000
members from January, 2000

10
00:00:29,690 --> 00:00:34,620
until November, 2007.

11
00:00:34,620 --> 00:00:37,980
We concentrated on members
with the four main attributes.

12
00:00:37,980 --> 00:00:41,200
At least five claims with
coronary artery disease

13
00:00:41,200 --> 00:00:44,720
diagnosis, at least five
claims with hypertension

14
00:00:44,720 --> 00:00:48,480
diagnostic codes, at least
100 total medical claims,

15
00:00:48,480 --> 00:00:51,420
at least five pharmacy
claims, and data

16
00:00:51,420 --> 00:00:53,740
from at least five years.

17
00:00:53,740 --> 00:00:57,020
These selections yield
patients with a high risk

18
00:00:57,020 --> 00:00:58,030
of heart attack.

19
00:00:58,030 --> 00:01:00,740
And a reasonably rich medical
history with continuous

20
00:01:00,740 --> 00:01:01,240
coverage.

21
00:01:04,819 --> 00:01:08,360
Let us discuss how we've
aggregated this data.

22
00:01:08,360 --> 00:01:11,930
The resulting data sets
includes about 20 million health

23
00:01:11,930 --> 00:01:14,520
insurance entries, including
individual, medical,

24
00:01:14,520 --> 00:01:16,150
and pharmaceutical records.

25
00:01:16,150 --> 00:01:19,600
Diagnosis, procedures,
and drug codes in the data

26
00:01:19,600 --> 00:01:22,640
set comprised tens of
thousands of attributes.

27
00:01:22,640 --> 00:01:25,460
The codes were
aggregated into groups.

28
00:01:25,460 --> 00:01:29,500
218 diagnosis groups,
180 procedure groups,

29
00:01:29,500 --> 00:01:33,150
538 drug groups.

30
00:01:33,150 --> 00:01:35,350
46 diagnosis groups
were considered

31
00:01:35,350 --> 00:01:38,720
by clinicians as possible risk
factors for heart attacks.

32
00:01:43,240 --> 00:01:46,759
Let us discuss how we
view the data over time.

33
00:01:46,759 --> 00:01:49,940
It is important in this study
to view the medical records

34
00:01:49,940 --> 00:01:53,100
chronologically, and to
represent a patient's diagnosis

35
00:01:53,100 --> 00:01:54,770
profile over time.

36
00:01:54,770 --> 00:01:59,970
So we record the cost and number
of medical claims and hospital

37
00:01:59,970 --> 00:02:02,390
visits by a diagnosis.

38
00:02:02,390 --> 00:02:06,440
All the observations we have
span over five years of data.

39
00:02:06,440 --> 00:02:11,890
They were split into 21
periods, each 90 days in length.

40
00:02:11,890 --> 00:02:15,110
We examine nine months
of diagnostic history,

41
00:02:15,110 --> 00:02:18,440
leading up to heart attack
or no heart attack event,

42
00:02:18,440 --> 00:02:23,980
and align the data to make
observations date-independent,

43
00:02:23,980 --> 00:02:26,910
while preserving
the order of events.

44
00:02:26,910 --> 00:02:29,950
We recorded the diagnostic
history in three periods.

45
00:02:29,950 --> 00:02:33,230
Zero to three months
before the event,

46
00:02:33,230 --> 00:02:35,090
three to six months
before the event,

47
00:02:35,090 --> 00:02:37,600
and six to nine months
before the event.

48
00:02:40,280 --> 00:02:42,940
What was a target variable
we're trying to predict?

49
00:02:42,940 --> 00:02:44,940
The target prediction
variable is the occurrence

50
00:02:44,940 --> 00:02:46,350
of a heart attack.

51
00:02:46,350 --> 00:02:49,730
We define this from a
combination of several claims.

52
00:02:49,730 --> 00:02:52,060
Namely, diagnosis
of a heart attack,

53
00:02:52,060 --> 00:02:54,700
alongside a trip to
the emergency room,

54
00:02:54,700 --> 00:02:58,000
followed by subsequent
hospitalization.

55
00:02:58,000 --> 00:03:00,510
Only considering
heart attack diagnosis

56
00:03:00,510 --> 00:03:03,340
that are associated with the
visits to an emergency room,

57
00:03:03,340 --> 00:03:05,810
and following
hospitalization helps

58
00:03:05,810 --> 00:03:09,050
ensure that the target outcome
is in fact a heart attack

59
00:03:09,050 --> 00:03:10,100
event.

60
00:03:10,100 --> 00:03:12,190
The target variable is binary.

61
00:03:12,190 --> 00:03:14,670
It is denoted by
plus 1 or minus 1

62
00:03:14,670 --> 00:03:16,790
for the occurrence or
non-occurrence of a heart

63
00:03:16,790 --> 00:03:19,790
attack in the targeted
period of 90 days.

64
00:03:22,720 --> 00:03:24,090
How's the data organized?

65
00:03:24,090 --> 00:03:26,690
There were 147 variables.

66
00:03:26,690 --> 00:03:29,650
Variable one is the patient's
identification number,

67
00:03:29,650 --> 00:03:32,030
and variable two is
the patient's gender.

68
00:03:32,030 --> 00:03:35,160
There were variables related
to the diagnoses group

69
00:03:35,160 --> 00:03:38,410
counts nine, six, and three
months before the heart attack

70
00:03:38,410 --> 00:03:39,660
target period.

71
00:03:39,660 --> 00:03:44,950
There were variables related
to the total course nine, six,

72
00:03:44,950 --> 00:03:46,910
and three months before
the heart attack target

73
00:03:46,910 --> 00:03:51,630
period, and the final
variable for 147,

74
00:03:51,630 --> 00:03:54,870
includes the classification of
whether the event was a heart

75
00:03:54,870 --> 00:03:56,020
attack or not.

76
00:03:58,860 --> 00:04:02,940
Cost of medical care is a good
summary of a person's health.

77
00:04:02,940 --> 00:04:08,110
In our database, the total cost
of medical care in the three 90

78
00:04:08,110 --> 00:04:15,160
day periods preceding the heart
attack target event ranged from

79
00:04:15,160 --> 00:04:20,760
$0.00 to $636,000
and approximately 70%

80
00:04:20,760 --> 00:04:24,880
of the overall cost were
generated by only 11%

81
00:04:24,880 --> 00:04:26,360
of the population.

82
00:04:26,360 --> 00:04:28,920
This means that the
highest patients

83
00:04:28,920 --> 00:04:32,460
with high medical expenses
are a very small proportion

84
00:04:32,460 --> 00:04:35,810
of the data, and could
skew our final results.

85
00:04:35,810 --> 00:04:39,700
According to the American
Medical Association, only 10%

86
00:04:39,700 --> 00:04:42,640
of individuals have
projected medical expenses

87
00:04:42,640 --> 00:04:45,790
of approximately
$10,000 or greater

88
00:04:45,790 --> 00:04:48,960
per year, which is more
than four times greater

89
00:04:48,960 --> 00:04:52,470
than the average projected
medical expenses of 2,400

90
00:04:52,470 --> 00:04:53,880
per year.

91
00:04:53,880 --> 00:04:57,090
To lessen the effects of
these high-cost outliers,

92
00:04:57,090 --> 00:04:59,960
we divided the data into
different cost buckets,

93
00:04:59,960 --> 00:05:03,770
based on the findings of the
American Medical Association.

94
00:05:03,770 --> 00:05:06,140
We did not want to
have too many cost bins

95
00:05:06,140 --> 00:05:08,840
because the size
of the data set.

96
00:05:08,840 --> 00:05:12,140
The table in the
slide gives a summary

97
00:05:12,140 --> 00:05:13,990
of the cost bucket partitions.

98
00:05:13,990 --> 00:05:17,550
Patients with expenses over
$10,000 in the nine month

99
00:05:17,550 --> 00:05:20,980
period were allocated
to cost bucket 3.

100
00:05:20,980 --> 00:05:23,840
Patients with less
than 2,000 in expenses

101
00:05:23,840 --> 00:05:25,910
were allocated to cost bucket 1.

102
00:05:25,910 --> 00:05:30,170
And the remaining patients with
costs between 2,000 and 10,000

103
00:05:30,170 --> 00:05:31,850
to cost bucket 2.

104
00:05:31,850 --> 00:05:40,360
Please note that the majority
of patients, 4,400 out of 6,500,

105
00:05:40,360 --> 00:05:44,590
or 67.5% of all patients
fell into the first bucket

106
00:05:44,590 --> 00:05:46,470
of low expenses.