1 00:00:14,790 --> 00:00:17,270 PETER SZOLOVITS: OK. 2 00:00:17,270 --> 00:00:21,320 So today and next Tuesday, we're talking 3 00:00:21,320 --> 00:00:24,590 about the role of natural language processing in machine 4 00:00:24,590 --> 00:00:26,670 learning in health care. 5 00:00:26,670 --> 00:00:29,960 And this is going to be a heterogeneous kind 6 00:00:29,960 --> 00:00:32,150 of presentation. 7 00:00:32,150 --> 00:00:36,650 Mainly today, I'm going to talk about stuff that happened 8 00:00:36,650 --> 00:00:38,690 or that takes advantage of methods 9 00:00:38,690 --> 00:00:44,030 that are not based on neural network representations. 10 00:00:44,030 --> 00:00:46,220 And on Tuesday, I'm going to speak mostly 11 00:00:46,220 --> 00:00:50,690 about stuff that does depend on neural network representations, 12 00:00:50,690 --> 00:00:54,620 but I'm not sure where the boundary is going to fall. 13 00:00:54,620 --> 00:00:58,190 I've also invited Dr. Katherine Liao 14 00:00:58,190 --> 00:01:00,950 over there, who will join me in a question 15 00:01:00,950 --> 00:01:04,459 and answer session and interview like we did a couple of weeks 16 00:01:04,459 --> 00:01:06,830 ago with David. 17 00:01:06,830 --> 00:01:14,690 Kat is a rheumatologist in the Partners HealthCare system. 18 00:01:14,690 --> 00:01:17,750 And you'll actually be hearing about some of the work 19 00:01:17,750 --> 00:01:19,730 that we've done together in the past 20 00:01:19,730 --> 00:01:22,790 before we go to the interview. 21 00:01:22,790 --> 00:01:27,080 So roughly, the outline of these two lectures 22 00:01:27,080 --> 00:01:30,680 is that I want to talk a little bit about why 23 00:01:30,680 --> 00:01:35,600 we care about clinical text. 24 00:01:35,600 --> 00:01:41,080 And then I'm going to talk about some conceptually very 25 00:01:41,080 --> 00:01:44,770 appealing, but practically not very feasible 26 00:01:44,770 --> 00:01:49,150 methods that involve analyzing these narrative 27 00:01:49,150 --> 00:01:54,340 texts as linguistic entities, as linguistic objects 28 00:01:54,340 --> 00:01:57,220 in the way that a linguist might approach them. 29 00:01:57,220 --> 00:02:00,460 And then we're going to talk about what is very often done, 30 00:02:00,460 --> 00:02:03,970 which is a kind of term spotting approach that says, 31 00:02:03,970 --> 00:02:08,350 well, we may not be able to understand exactly everything 32 00:02:08,350 --> 00:02:11,170 that goes on in the narratives, but we 33 00:02:11,170 --> 00:02:14,980 can identify certain words and certain phrases that 34 00:02:14,980 --> 00:02:18,090 are very highly indicative that the patient has 35 00:02:18,090 --> 00:02:20,110 a certain disease, a certain symptom, 36 00:02:20,110 --> 00:02:23,510 that some particular thing was done to them. 37 00:02:23,510 --> 00:02:26,080 And so this is a lot of the bread and butter 38 00:02:26,080 --> 00:02:31,540 of how clinical research is done nowadays. 39 00:02:31,540 --> 00:02:35,630 And then I'll go on to some other techniques. 40 00:02:35,630 --> 00:02:36,970 So here's an example. 41 00:02:36,970 --> 00:02:42,550 This is a discharge summary from MIMIC. 42 00:02:42,550 --> 00:02:46,240 When you played with MIMIC, you notice that it's de-identified. 43 00:02:46,240 --> 00:02:48,610 And so names and things are replaced 44 00:02:48,610 --> 00:02:53,350 with square brackets, star, star, star kinds of things. 45 00:02:53,350 --> 00:02:57,700 And here I have replaced-- we replaced those 46 00:02:57,700 --> 00:02:59,980 with synthetic names. 47 00:02:59,980 --> 00:03:05,110 So Mr. Blind isn't really Mr. Blind, 48 00:03:05,110 --> 00:03:10,330 and November 15 probably really isn't November 15, et cetera. 49 00:03:10,330 --> 00:03:14,720 But I wanted something that read like real text. 50 00:03:14,720 --> 00:03:19,060 So if you look at something like this, you see that Mr. 51 00:03:19,060 --> 00:03:22,990 Blind is a 79-year-old white white male-- 52 00:03:22,990 --> 00:03:25,300 so somebody repeated a word-- 53 00:03:25,300 --> 00:03:30,010 with a history of diabetes mellitus and inferior MI, 54 00:03:30,010 --> 00:03:33,790 who underwent open repair of his increased diverticulum 55 00:03:33,790 --> 00:03:36,490 on November 13 at some-- 56 00:03:36,490 --> 00:03:40,210 again, that's not the name of the actual place-- 57 00:03:40,210 --> 00:03:41,770 medical center. 58 00:03:41,770 --> 00:03:44,260 And then he developed hematemesis, 59 00:03:44,260 --> 00:03:47,620 so he was spitting up blood, and was intubated 60 00:03:47,620 --> 00:03:49,045 for respiratory distress. 61 00:03:49,045 --> 00:03:51,230 So he wasn't breathing well. 62 00:03:51,230 --> 00:03:54,280 So these are all really important things about what 63 00:03:54,280 --> 00:03:57,200 happened to Mr. Blind. 64 00:03:57,200 --> 00:04:02,110 And so we'd like to be able to take advantage of this. 65 00:04:02,110 --> 00:04:04,990 And in fact, to give you a slightly more 66 00:04:04,990 --> 00:04:10,390 quantitative version of this, Kat and I worked on a project 67 00:04:10,390 --> 00:04:14,620 back around 2010 where we were looking 68 00:04:14,620 --> 00:04:20,140 at trying to understand what are the genetic correlates 69 00:04:20,140 --> 00:04:23,180 of rheumatoid arthritis. 70 00:04:23,180 --> 00:04:27,700 And so we went to the research patient data repository 71 00:04:27,700 --> 00:04:33,280 of Mass General and the Brigham Partners HealthCare, 72 00:04:33,280 --> 00:04:37,000 and we said, OK, who are the patients who 73 00:04:37,000 --> 00:04:41,360 have been billed for a rheumatoid arthritis visit? 74 00:04:41,360 --> 00:04:45,160 And there are many thousands of those people, OK? 75 00:04:45,160 --> 00:04:49,390 And then we selected a random set of I 76 00:04:49,390 --> 00:04:52,360 think 400 of those patients. 77 00:04:52,360 --> 00:04:55,840 We gave them to rheumatologists, and we said, 78 00:04:55,840 --> 00:05:01,330 which of these people actually have rheumatoid arthritis? 79 00:05:01,330 --> 00:05:03,950 So these were based on billing codes. 80 00:05:03,950 --> 00:05:08,770 So what would you guess is the positive predictive value 81 00:05:08,770 --> 00:05:11,502 of having a billing code for rheumatoid arthritis 82 00:05:11,502 --> 00:05:12,210 in this data set? 83 00:05:16,280 --> 00:05:21,040 I mean, how many people think it's more than 50%? 84 00:05:21,040 --> 00:05:24,970 OK, that would be nice, but it's not. 85 00:05:24,970 --> 00:05:26,805 How many people think it's more than 25%? 86 00:05:30,410 --> 00:05:33,920 God, you guys are getting really pessimistic. 87 00:05:33,920 --> 00:05:35,060 Well, it also isn't. 88 00:05:38,430 --> 00:05:44,250 It turned out to be something like 19% in this cohort. 89 00:05:44,250 --> 00:05:47,570 Now, before you start calling, you 90 00:05:47,570 --> 00:05:51,040 know, the fraud investigators, you 91 00:05:51,040 --> 00:05:56,330 have to ask yourself why is it that this data is so lousy, 92 00:05:56,330 --> 00:05:58,030 right? 93 00:05:58,030 --> 00:06:02,380 And there's a systematic reason, because those billing codes 94 00:06:02,380 --> 00:06:05,230 were not created in order to specify 95 00:06:05,230 --> 00:06:07,330 what's wrong with the patient. 96 00:06:07,330 --> 00:06:09,490 They were created in order to tell 97 00:06:09,490 --> 00:06:12,640 an insurance company or Medicare or somebody 98 00:06:12,640 --> 00:06:15,910 how much of a payment is deserved by the doctors 99 00:06:15,910 --> 00:06:17,990 taking care of them. 100 00:06:17,990 --> 00:06:21,340 And so what this means is that, for example, if I 101 00:06:21,340 --> 00:06:26,320 clutch my chest and go, uh, and an ambulance rushes me over 102 00:06:26,320 --> 00:06:29,710 to Mass General and they do a whole bunch of tests 103 00:06:29,710 --> 00:06:34,190 and they decide that I'm not having a heart attack, 104 00:06:34,190 --> 00:06:36,640 the correct billing code for that visit 105 00:06:36,640 --> 00:06:39,650 is myocardial infarction. 106 00:06:39,650 --> 00:06:41,360 Because of course the work that they 107 00:06:41,360 --> 00:06:45,230 have to do in order to figure out that I'm not having a heart 108 00:06:45,230 --> 00:06:47,450 attack is the same as the work they 109 00:06:47,450 --> 00:06:50,180 would have had to do to figure out that I was having a heart 110 00:06:50,180 --> 00:06:52,040 attack. 111 00:06:52,040 --> 00:06:54,610 And so the billing codes-- 112 00:06:54,610 --> 00:06:56,140 we've talked about this a little bit 113 00:06:56,140 --> 00:06:59,350 before-- but they are a very imperfect representation 114 00:06:59,350 --> 00:07:00,760 of reality. 115 00:07:00,760 --> 00:07:03,430 So we said, well, OK. 116 00:07:03,430 --> 00:07:06,580 What if we insisted that you have 117 00:07:06,580 --> 00:07:09,640 three billing codes for rheumatoid arthritis 118 00:07:09,640 --> 00:07:11,560 rather than just one. 119 00:07:11,560 --> 00:07:14,710 And that turned out to raise the positive predictive value 120 00:07:14,710 --> 00:07:15,810 all the way up to 27%. 121 00:07:18,910 --> 00:07:20,490 So we go, really? 122 00:07:20,490 --> 00:07:23,440 How could you get billed three times? 123 00:07:23,440 --> 00:07:24,330 Right? 124 00:07:24,330 --> 00:07:27,990 Well, the answer is that you get billed for, you know, 125 00:07:27,990 --> 00:07:31,470 every aspirin you take at the hospital. 126 00:07:31,470 --> 00:07:34,950 And so for example, it's very easy 127 00:07:34,950 --> 00:07:38,490 to accumulate three billing codes for the same thing 128 00:07:38,490 --> 00:07:42,360 because you go see a doctor, the doctor bills you 129 00:07:42,360 --> 00:07:45,470 for a rheumatoid arthritis visit, 130 00:07:45,470 --> 00:07:49,620 he or she sends you to a radiologist 131 00:07:49,620 --> 00:07:53,700 to take an X-ray of your fingers and your joints. 132 00:07:53,700 --> 00:07:57,340 That bill is another billing code for RA. 133 00:07:57,340 --> 00:07:59,820 The doctor also sends you to the lab 134 00:07:59,820 --> 00:08:01,920 to have a blood draw so that they 135 00:08:01,920 --> 00:08:04,860 can check your anti-CCP titer. 136 00:08:04,860 --> 00:08:08,100 That's another billing code for rheumatoid arthritis. 137 00:08:08,100 --> 00:08:10,680 And it may be that all of this is negative 138 00:08:10,680 --> 00:08:13,950 and you don't actually have the disease. 139 00:08:13,950 --> 00:08:19,230 So this is something that's really important to think about 140 00:08:19,230 --> 00:08:23,280 and to remember when you're analyzing these data. 141 00:08:23,280 --> 00:08:26,070 And so we started off in this project saying, 142 00:08:26,070 --> 00:08:31,800 well, we need to get a positive predictive value 143 00:08:31,800 --> 00:08:36,210 more on the order of 95%, because we wanted a very 144 00:08:36,210 --> 00:08:40,590 pure sample of people who really did have the disease because we 145 00:08:40,590 --> 00:08:44,039 were going to take blood samples from those patients, 146 00:08:44,039 --> 00:08:48,780 pay a bunch of money to the Broad to analyze them, 147 00:08:48,780 --> 00:08:51,030 and then hopefully come up with a better 148 00:08:51,030 --> 00:08:53,370 understanding of the relationship 149 00:08:53,370 --> 00:08:56,790 between their genetics and their disease. 150 00:08:56,790 --> 00:09:02,700 And of course, if you talk to a biostatistician, as we did, 151 00:09:02,700 --> 00:09:08,040 they told us that if we have more than about 5% corruption 152 00:09:08,040 --> 00:09:10,620 of that database, then we're going to get 153 00:09:10,620 --> 00:09:13,240 meaningless results from it. 154 00:09:13,240 --> 00:09:15,720 So that's the goal here. 155 00:09:15,720 --> 00:09:20,730 So what we did is to say, well, if you 156 00:09:20,730 --> 00:09:26,340 train a data set that tries to tell you whether somebody 157 00:09:26,340 --> 00:09:29,010 really has rheumatoid arthritis or not 158 00:09:29,010 --> 00:09:31,790 based on just codified data. 159 00:09:31,790 --> 00:09:35,970 So codified data are things like lab values and prescriptions 160 00:09:35,970 --> 00:09:41,700 and demographics and stuff that is in tabular form. 161 00:09:41,700 --> 00:09:47,955 Then we were getting a positive predictive value of about 88%. 162 00:09:47,955 --> 00:09:52,470 We said, well, how well could we do 163 00:09:52,470 --> 00:09:56,460 by, instead of looking at that codified data, 164 00:09:56,460 --> 00:10:01,680 looking at the narrative text in nursing notes, doctor's notes, 165 00:10:01,680 --> 00:10:04,620 discharge summaries, various other sources. 166 00:10:04,620 --> 00:10:07,020 Could we do as well or better? 167 00:10:07,020 --> 00:10:14,160 And the answer turned out that we were getting about 89% 168 00:10:14,160 --> 00:10:18,810 using only the natural language processing on these notes. 169 00:10:18,810 --> 00:10:21,900 And not surprisingly, when you put them together, 170 00:10:21,900 --> 00:10:27,460 the joint model gave us about 94%. 171 00:10:27,460 --> 00:10:30,910 So that was definitely an improvement. 172 00:10:30,910 --> 00:10:36,970 So this was published in 2010, and so this is not the latest 173 00:10:36,970 --> 00:10:41,920 hot off the bench results. 174 00:10:41,920 --> 00:10:45,130 But to me, it's a very compelling story 175 00:10:45,130 --> 00:10:49,310 that says there is real value in these clinical narratives. 176 00:10:54,700 --> 00:10:56,900 OK, so how did we do this? 177 00:10:56,900 --> 00:11:01,060 Well, we took about four million patients in the EMR. 178 00:11:01,060 --> 00:11:09,580 We selected about 29,000 of them by requiring 179 00:11:09,580 --> 00:11:11,980 that they have at least one ICD-9 180 00:11:11,980 --> 00:11:15,310 code for rheumatoid arthritis, or that they've 181 00:11:15,310 --> 00:11:20,180 had an anti-CCP titer done in the lab. 182 00:11:20,180 --> 00:11:25,080 And then we-- oh, it was 500, not 400. 183 00:11:25,080 --> 00:11:31,170 So we looked at 500 cases, which we 184 00:11:31,170 --> 00:11:33,480 got gold standard readings on. 185 00:11:33,480 --> 00:11:37,380 And then we trained an algorithm that 186 00:11:37,380 --> 00:11:41,430 predicted whether this patient really had RA or not. 187 00:11:41,430 --> 00:11:44,220 And that predicted about 35-- 188 00:11:44,220 --> 00:11:47,010 well, 3,585 cases. 189 00:11:47,010 --> 00:11:51,630 We then sampled a validation set of 400 of those. 190 00:11:51,630 --> 00:11:54,360 We threatened our rheumatologists 191 00:11:54,360 --> 00:11:59,220 with bodily harm if they didn't read all those cases 192 00:11:59,220 --> 00:12:01,830 and give us a gold standard judgment. 193 00:12:01,830 --> 00:12:03,210 No, I'm kidding. 194 00:12:03,210 --> 00:12:05,130 They were actually really cooperative. 195 00:12:07,920 --> 00:12:11,130 And there are some details here that you 196 00:12:11,130 --> 00:12:13,560 can look at in the slide, and I had 197 00:12:13,560 --> 00:12:15,570 a pointer to the original paper if you're 198 00:12:15,570 --> 00:12:18,400 interested in the details. 199 00:12:18,400 --> 00:12:21,750 But we were looking at ICD-9 codes 200 00:12:21,750 --> 00:12:25,440 for rheumatoid arthritis and related diseases. 201 00:12:25,440 --> 00:12:28,680 We excluded some ICD-9 codes that 202 00:12:28,680 --> 00:12:34,200 fall under the general category of rheumatoid diseases 203 00:12:34,200 --> 00:12:38,850 because they're not correct for the sample 204 00:12:38,850 --> 00:12:41,520 that we were interested in. 205 00:12:41,520 --> 00:12:43,710 We dealt with this multiple coding 206 00:12:43,710 --> 00:12:48,450 by ignoring codes that happened within a week of each other 207 00:12:48,450 --> 00:12:53,190 so that we didn't get this problem of multiple bills 208 00:12:53,190 --> 00:12:54,810 from the same visit. 209 00:12:54,810 --> 00:12:59,250 And then we looked for electronic prescriptions 210 00:12:59,250 --> 00:13:00,780 of various sorts. 211 00:13:00,780 --> 00:13:04,830 We looked for lab tests, mainly RF, rheumatoid factor, 212 00:13:04,830 --> 00:13:08,490 and anti-cyclic citrullinated peptide, 213 00:13:08,490 --> 00:13:11,010 if I pronounced that correctly. 214 00:13:11,010 --> 00:13:14,010 And another thing we found, not only in this study 215 00:13:14,010 --> 00:13:17,010 but in a number of others, is it's very helpful 216 00:13:17,010 --> 00:13:19,080 just to count up how many facts are 217 00:13:19,080 --> 00:13:22,350 on the database about a particular patient. 218 00:13:22,350 --> 00:13:26,340 That's not a bad proxy for how sick they are, right? 219 00:13:26,340 --> 00:13:28,260 If you're not very sick, you tend 220 00:13:28,260 --> 00:13:30,280 to have a little bit of data. 221 00:13:30,280 --> 00:13:33,580 And if you're sicker, you tend to have more data. 222 00:13:33,580 --> 00:13:37,710 So these were the cohort selection. 223 00:13:37,710 --> 00:13:41,820 And then for the narrative text, we 224 00:13:41,820 --> 00:13:48,120 used a system that was built by Qing Zeng and her colleagues 225 00:13:48,120 --> 00:13:51,030 at the time-- it was called HITex. 226 00:13:51,030 --> 00:13:54,690 It's definitely not state of the art today. 227 00:13:54,690 --> 00:13:59,070 But this was a system that extracted entities 228 00:13:59,070 --> 00:14:06,280 from narrative text and did a capable job for its era. 229 00:14:06,280 --> 00:14:09,880 And we did this from health care provider notes, radiology 230 00:14:09,880 --> 00:14:15,010 and pathology reports, discharge summaries, operative reports. 231 00:14:15,010 --> 00:14:20,440 And we also extracted disease diagnosis notes, 232 00:14:20,440 --> 00:14:23,500 mentions from the same data, medications, 233 00:14:23,500 --> 00:14:27,640 lab data, radiology findings, et cetera. 234 00:14:27,640 --> 00:14:32,710 And then we had augmented the list that came with that tool 235 00:14:32,710 --> 00:14:37,060 with the sort of hand-curated list of alternative ways 236 00:14:37,060 --> 00:14:41,290 of saying the same thing in order to expand our coverage. 237 00:14:41,290 --> 00:14:43,900 And we played with negation detection 238 00:14:43,900 --> 00:14:47,290 because, of course, if a note says the patient does not 239 00:14:47,290 --> 00:14:52,360 have x, then you don't want to say the patient had x because x 240 00:14:52,360 --> 00:14:53,410 was mentioned. 241 00:14:53,410 --> 00:14:57,320 And I'll say a few more words about that in a minute. 242 00:14:57,320 --> 00:15:00,690 So if you look at the model we built 243 00:15:00,690 --> 00:15:05,310 using logistic regression, which is a very common method, what 244 00:15:05,310 --> 00:15:08,220 you find is that there are positive and negative 245 00:15:08,220 --> 00:15:12,180 predictors, and the predictors actually 246 00:15:12,180 --> 00:15:14,850 are an interesting mix of ones based 247 00:15:14,850 --> 00:15:19,210 on natural language processing and ones that are codified. 248 00:15:19,210 --> 00:15:23,160 So for example, you have rheumatoid arthritis. 249 00:15:23,160 --> 00:15:26,940 If a note says the patient has rheumatoid arthritis, 250 00:15:26,940 --> 00:15:31,470 that's pretty good evidence that they do. 251 00:15:31,470 --> 00:15:36,270 If somebody is characterized as being seropositive, 252 00:15:36,270 --> 00:15:38,730 that's again good evidence. 253 00:15:38,730 --> 00:15:41,450 And then erosions and so on. 254 00:15:41,450 --> 00:15:43,680 But they're also codified things, 255 00:15:43,680 --> 00:15:48,540 like if you see that the rheumatoid factor in a lab test 256 00:15:48,540 --> 00:15:52,890 was negative, then-- 257 00:15:52,890 --> 00:15:55,540 actually, I don't know why that's-- 258 00:15:55,540 --> 00:15:56,850 oh, no, that counts against-- 259 00:15:56,850 --> 00:15:58,050 OK. 260 00:15:58,050 --> 00:15:59,950 And then various exclusions. 261 00:15:59,950 --> 00:16:02,520 So these were the things selected 262 00:16:02,520 --> 00:16:07,790 by our regularized logistic regression algorithm. 263 00:16:07,790 --> 00:16:14,740 And I showed you the results before. 264 00:16:14,740 --> 00:16:18,090 So we were able to get a positive predictive value 265 00:16:18,090 --> 00:16:19,160 of about 0.94. 266 00:16:19,160 --> 00:16:19,660 Yeah? 267 00:16:19,660 --> 00:16:21,077 AUDIENCE: In a the previous slide, 268 00:16:21,077 --> 00:16:23,940 you said standardized regression coefficients. 269 00:16:23,940 --> 00:16:28,350 So why did you standardize? 270 00:16:28,350 --> 00:16:29,680 Maybe I got the words wrong. 271 00:16:29,680 --> 00:16:31,345 Just on the previous slide, the-- 272 00:16:42,500 --> 00:16:43,910 PETER SZOLOVITS: I think-- 273 00:16:43,910 --> 00:16:47,970 so the regression coefficients in a logistic regression 274 00:16:47,970 --> 00:16:52,640 are typically just odds ratios, right? 275 00:16:52,640 --> 00:16:56,920 So they tell you whether something makes a diagnosis 276 00:16:56,920 --> 00:16:59,440 more or less likely. 277 00:16:59,440 --> 00:17:02,810 And where does it say standardized? 278 00:17:02,810 --> 00:17:04,575 AUDIENCE: [INAUDIBLE]. 279 00:17:04,575 --> 00:17:07,300 PETER SZOLOVITS: Oh, regression standardized. 280 00:17:07,300 --> 00:17:08,980 I don't know why it says standardized. 281 00:17:08,980 --> 00:17:11,035 Do you know why it says standardized? 282 00:17:11,035 --> 00:17:12,410 KATHERINE LIAO: Couple of things. 283 00:17:12,410 --> 00:17:15,210 One is, when you run an algorithm 284 00:17:15,210 --> 00:17:17,560 right on your data set, you can't 285 00:17:17,560 --> 00:17:20,062 port it using the same coefficients because it's going 286 00:17:20,062 --> 00:17:21,270 to be different for each one. 287 00:17:21,270 --> 00:17:23,871 So we didn't want people to feel like they can just add it on. 288 00:17:23,871 --> 00:17:25,579 The other thing, when you standardize it, 289 00:17:25,579 --> 00:17:29,040 is you can see the relative weight of each coefficient. 290 00:17:29,040 --> 00:17:30,410 So it's kind of a measure. 291 00:17:30,410 --> 00:17:33,380 Not exactly of how important each coefficient was. 292 00:17:33,380 --> 00:17:34,940 That's our way of-- if you can see, 293 00:17:34,940 --> 00:17:39,320 we ranked it by the standardized regression coefficient. 294 00:17:39,320 --> 00:17:41,910 So NL PRA is up top at 1.11. 295 00:17:41,910 --> 00:17:46,600 So that has the highest weight. 296 00:17:46,600 --> 00:17:50,710 Whereas the other DMARDs lend it only a little bit more. 297 00:17:50,710 --> 00:17:53,240 PETER SZOLOVITS: OK. 298 00:17:53,240 --> 00:17:53,740 Yes? 299 00:17:53,740 --> 00:17:56,730 AUDIENCE: The variables where NL PRA, where 300 00:17:56,730 --> 00:17:58,730 it says rheumatoid arthritis in the test, 301 00:17:58,730 --> 00:18:01,220 were these presence of or if they're count? 302 00:18:01,220 --> 00:18:02,770 PETER SZOLOVITS: Yeah. 303 00:18:02,770 --> 00:18:03,880 Assuming it's present. 304 00:18:03,880 --> 00:18:06,760 So the negation algorithm hopefully 305 00:18:06,760 --> 00:18:09,760 would have picked up if it said it's absent 306 00:18:09,760 --> 00:18:11,530 and you wouldn't get that feature. 307 00:18:14,850 --> 00:18:16,020 All right? 308 00:18:16,020 --> 00:18:19,470 So here's an interesting thing. 309 00:18:19,470 --> 00:18:24,750 This group, I was not involved in this particular project, 310 00:18:24,750 --> 00:18:27,990 said, well, could we replicate the study at Vanderbilt 311 00:18:27,990 --> 00:18:29,550 and at Northwestern University? 312 00:18:29,550 --> 00:18:31,830 So we have colleagues in those places. 313 00:18:31,830 --> 00:18:34,740 They also have electronic medical record systems. 314 00:18:34,740 --> 00:18:38,430 They also are interested in identifying people 315 00:18:38,430 --> 00:18:40,960 with rheumatoid arthritis. 316 00:18:40,960 --> 00:18:43,950 And so Partners had about 4 million patients, 317 00:18:43,950 --> 00:18:49,200 Northwestern had 2.2, Vanderbilt had 1.7. 318 00:18:49,200 --> 00:18:53,610 And we couldn't run exactly the same stuff because, of course, 319 00:18:53,610 --> 00:18:56,620 these are different systems. 320 00:18:56,620 --> 00:19:00,030 And so the medications, for example, 321 00:19:00,030 --> 00:19:06,300 were extracted from their local EMR in very different ways. 322 00:19:06,300 --> 00:19:10,170 And the natural language queries were also 323 00:19:10,170 --> 00:19:12,750 extracted in different ways because Vanderbilt, 324 00:19:12,750 --> 00:19:16,170 for example, already had a tool in place 325 00:19:16,170 --> 00:19:18,720 where they would try to translate 326 00:19:18,720 --> 00:19:22,040 any text in their notes into UMLS 327 00:19:22,040 --> 00:19:27,250 less concepts, which we'll talk about again in a little while. 328 00:19:27,250 --> 00:19:30,090 So my expectation, when I heard about this study, 329 00:19:30,090 --> 00:19:32,880 is that this would be a disaster. 330 00:19:32,880 --> 00:19:37,230 That it would simply not work because there 331 00:19:37,230 --> 00:19:40,050 are local effects, local factors, 332 00:19:40,050 --> 00:19:44,280 local ways that people have of describing patients 333 00:19:44,280 --> 00:19:51,030 that I thought would be very different between Nashville, 334 00:19:51,030 --> 00:19:53,590 Chicago, and Boston. 335 00:19:53,590 --> 00:19:58,400 And much to my surprise, what they found was that, in fact, 336 00:19:58,400 --> 00:20:00,140 it kind of worked. 337 00:20:00,140 --> 00:20:04,785 So the model performance, even taking into account 338 00:20:04,785 --> 00:20:07,470 that the way the data was extracted out 339 00:20:07,470 --> 00:20:12,600 of the notes and clinical systems was different, 340 00:20:12,600 --> 00:20:14,430 was fairly similar. 341 00:20:14,430 --> 00:20:17,250 Now, one thing that is worrisome is 342 00:20:17,250 --> 00:20:22,140 that the PPV of our algorithm on our data, 343 00:20:22,140 --> 00:20:30,720 the way we calculated PPV, they calculated PPV in this study, 344 00:20:30,720 --> 00:20:36,040 came in lower than the way we had done it when we found it. 345 00:20:36,040 --> 00:20:38,980 And so there is a technical reason for it, 346 00:20:38,980 --> 00:20:40,860 but it's still disturbing that we're 347 00:20:40,860 --> 00:20:42,850 getting a different result. 348 00:20:42,850 --> 00:20:46,320 The technical reason is described here. 349 00:20:46,320 --> 00:20:51,480 Here, the PPV is estimated from a five-fold cross validation 350 00:20:51,480 --> 00:20:54,150 of the data, whereas in our study, 351 00:20:54,150 --> 00:20:58,500 we had a held out data set from which we were calculating 352 00:20:58,500 --> 00:21:00,490 the positive predictive value. 353 00:21:00,490 --> 00:21:02,700 So it's a different analysis. 354 00:21:02,700 --> 00:21:06,270 It's not that we made some arithmetic mistake. 355 00:21:06,270 --> 00:21:08,490 But this is interesting. 356 00:21:08,490 --> 00:21:12,220 And what you see is that if you plot the areas under-- 357 00:21:12,220 --> 00:21:15,870 or if you plot the ROC curves, what you see 358 00:21:15,870 --> 00:21:21,810 is that training on Northwestern data 359 00:21:21,810 --> 00:21:25,830 and testing on either Partners or Vanderbilt data 360 00:21:25,830 --> 00:21:27,870 was not so good. 361 00:21:27,870 --> 00:21:32,550 But training on either Partners or Vanderbilt data 362 00:21:32,550 --> 00:21:38,280 and testing on any of the others turned out to be quite decent. 363 00:21:38,280 --> 00:21:39,000 Right? 364 00:21:39,000 --> 00:21:43,010 So there is some generality to the algorithm. 365 00:21:43,010 --> 00:21:46,730 All right, I'm going to switch gears for a minute. 366 00:21:46,730 --> 00:21:53,510 So this was from an old paper by Barrows from 19 years ago. 367 00:21:53,510 --> 00:22:00,170 And he was reading nursing notes in an electronic medical 368 00:22:00,170 --> 00:22:01,760 records system. 369 00:22:01,760 --> 00:22:04,940 And he came up with a note which has 370 00:22:04,940 --> 00:22:09,790 exactly that text on the left hand side in the nursing note. 371 00:22:13,340 --> 00:22:16,580 Except it wasn't nicely separated into separate lines. 372 00:22:16,580 --> 00:22:18,802 It was all run together. 373 00:22:18,802 --> 00:22:19,760 So what does that mean? 374 00:22:24,510 --> 00:22:25,650 Anybody have a clue? 375 00:22:28,530 --> 00:22:32,450 I didn't when I was looking at it. 376 00:22:32,450 --> 00:22:35,000 So here's the interpretation. 377 00:22:37,820 --> 00:22:38,770 So that's a date. 378 00:22:38,770 --> 00:22:42,380 IPN stands for intern progress note. 379 00:22:42,380 --> 00:22:46,610 SOB, that's not what you think it means. 380 00:22:46,610 --> 00:22:49,510 It's shortness of breath. 381 00:22:49,510 --> 00:22:52,850 And DOE is dyspnea on exertion. 382 00:22:52,850 --> 00:22:57,220 So this is difficulty breathing when you're exerting yourself, 383 00:22:57,220 --> 00:23:00,430 but that has decreased, presumably 384 00:23:00,430 --> 00:23:02,830 from some previous assessment. 385 00:23:02,830 --> 00:23:07,210 And the patient's vital signs are stable, so VSS. 386 00:23:07,210 --> 00:23:11,179 And the patient is afebrile, AF. 387 00:23:11,179 --> 00:23:12,630 OK? 388 00:23:12,630 --> 00:23:15,670 Et cetera. 389 00:23:15,670 --> 00:23:21,100 So this is harder than reading the Wall Street Journal 390 00:23:21,100 --> 00:23:24,070 because the Wall Street Journal is 391 00:23:24,070 --> 00:23:27,970 meant to be readable by anybody who speaks English. 392 00:23:27,970 --> 00:23:31,210 And this is probably not meant to be readable by anybody 393 00:23:31,210 --> 00:23:34,090 except the person who wrote it or maybe 394 00:23:34,090 --> 00:23:36,760 their immediate friends and colleagues. 395 00:23:36,760 --> 00:23:40,750 So this is a real issue and one that we don't have 396 00:23:40,750 --> 00:23:43,890 a very good solution for yet. 397 00:23:43,890 --> 00:23:45,610 Now, what do you use NLP for? 398 00:23:45,610 --> 00:23:52,980 Well, I had mentioned that one of the things we want to do 399 00:23:52,980 --> 00:23:57,210 is to codify things that appear in a note. 400 00:23:57,210 --> 00:23:59,340 So if it says rheumatoid arthritis, 401 00:23:59,340 --> 00:24:01,500 we want to say, well, that's equivalent 402 00:24:01,500 --> 00:24:05,970 to a particular ICD-9 code. 403 00:24:05,970 --> 00:24:08,670 We might want to use natural language processing 404 00:24:08,670 --> 00:24:11,160 for de-identification of data. 405 00:24:11,160 --> 00:24:12,960 I mentioned that before. 406 00:24:12,960 --> 00:24:16,710 You don't, MIMIC, the only way that Roger Mark's group got 407 00:24:16,710 --> 00:24:19,710 permission to release that data and make 408 00:24:19,710 --> 00:24:22,730 it available for people like you to use 409 00:24:22,730 --> 00:24:25,860 is by persuading the IRB that we had 410 00:24:25,860 --> 00:24:27,660 done a good enough job of getting 411 00:24:27,660 --> 00:24:31,410 rid of all the identifying information in all 412 00:24:31,410 --> 00:24:35,190 of those records so that it's probably not technically 413 00:24:35,190 --> 00:24:37,650 impossible, but it's very difficult 414 00:24:37,650 --> 00:24:42,490 to figure out who the patients actually were in that cohort, 415 00:24:42,490 --> 00:24:44,250 in that database. 416 00:24:44,250 --> 00:24:48,150 And the reason we ask you to sign a data use agreement 417 00:24:48,150 --> 00:24:51,520 is to deal with that residual, you 418 00:24:51,520 --> 00:24:55,980 know, difficult but not necessarily impossible 419 00:24:55,980 --> 00:24:59,130 because of correlations with other data. 420 00:24:59,130 --> 00:25:00,570 And then you have little problems 421 00:25:00,570 --> 00:25:04,530 like Mr. Huntington suffers from Huntington's disease, in which 422 00:25:04,530 --> 00:25:08,550 the first Huntington is protected health information 423 00:25:08,550 --> 00:25:10,320 because it's a patient's name. 424 00:25:10,320 --> 00:25:12,030 The second Huntington is actually 425 00:25:12,030 --> 00:25:14,730 an important medical fact. 426 00:25:14,730 --> 00:25:19,200 And so you wouldn't want to get rid of that one. 427 00:25:19,200 --> 00:25:21,480 You want to determine aspects of each entity. 428 00:25:21,480 --> 00:25:26,430 Its time, its location, its degree of certainty. 429 00:25:26,430 --> 00:25:28,260 You want to look for relationships 430 00:25:28,260 --> 00:25:32,310 between different entities that are identified in the text. 431 00:25:32,310 --> 00:25:35,580 For example, does one precede another, does it cause it, 432 00:25:35,580 --> 00:25:39,550 does it treat it, prevent it, indicate it, et cetera? 433 00:25:39,550 --> 00:25:42,380 So there are a whole bunch of relationships like that 434 00:25:42,380 --> 00:25:44,320 that we're interested in. 435 00:25:44,320 --> 00:25:48,580 And then also, for certain kinds of applications, 436 00:25:48,580 --> 00:25:53,340 what you'd really like to do is to identify 437 00:25:53,340 --> 00:25:58,870 what part of a textual record addresses a certain question. 438 00:25:58,870 --> 00:26:02,070 So even if you can't tell what the answer is, 439 00:26:02,070 --> 00:26:04,770 you should able to point to a piece of the record 440 00:26:04,770 --> 00:26:07,440 and say, oh, this tells me about, 441 00:26:07,440 --> 00:26:11,130 in this case, the patient's exercise regimen. 442 00:26:11,130 --> 00:26:14,370 And then summarization is a very real challenge 443 00:26:14,370 --> 00:26:18,420 as well, especially because of the cut and paste that 444 00:26:18,420 --> 00:26:22,840 has come about as a result of these electronic medical record 445 00:26:22,840 --> 00:26:26,850 systems where, when a nurse is writing a new note, 446 00:26:26,850 --> 00:26:30,180 it's tempting and supported by the system 447 00:26:30,180 --> 00:26:33,570 for him or her to just take the old note, 448 00:26:33,570 --> 00:26:37,650 copy it over to a new note, and then maybe make a few changes. 449 00:26:37,650 --> 00:26:39,690 But that means that it's very repetitive. 450 00:26:39,690 --> 00:26:43,350 The same stuff is recorded over and over again. 451 00:26:43,350 --> 00:26:45,450 And sometimes that's not even appropriate 452 00:26:45,450 --> 00:26:47,752 because they may not have changed everything 453 00:26:47,752 --> 00:26:48,835 that needed to be changed. 454 00:26:52,140 --> 00:26:54,110 The other thing to keep in mind is that there 455 00:26:54,110 --> 00:26:56,610 are two very different tasks. 456 00:26:56,610 --> 00:27:00,710 So for example, if I'm doing de-identification, 457 00:27:00,710 --> 00:27:04,970 essentially I have to look at every word in a narrative 458 00:27:04,970 --> 00:27:09,010 in order to see whether it's protected health information. 459 00:27:09,010 --> 00:27:12,500 But there are often aggregate judgments that I need to make, 460 00:27:12,500 --> 00:27:16,500 where many of the words don't make any difference. 461 00:27:16,500 --> 00:27:20,480 And so for example, one of the first challenges 462 00:27:20,480 --> 00:27:24,500 that we ran back in 2006 was where 463 00:27:24,500 --> 00:27:28,040 we gave people medical records, narrative text 464 00:27:28,040 --> 00:27:32,550 records from a bunch of patients and said, 465 00:27:32,550 --> 00:27:33,590 is this person a smoker? 466 00:27:38,050 --> 00:27:41,260 Well, you can imagine that there are certain words that 467 00:27:41,260 --> 00:27:47,910 are very helpful like smoker or tobacco user 468 00:27:47,910 --> 00:27:50,240 or something like that. 469 00:27:50,240 --> 00:27:53,090 But even those are sometimes misleading. 470 00:27:53,090 --> 00:27:56,380 So for example, we saw somebody who 471 00:27:56,380 --> 00:28:02,140 happened to be a researcher working on tobacco mosaic virus 472 00:28:02,140 --> 00:28:05,380 who was not a smoker. 473 00:28:05,380 --> 00:28:09,340 And then you have interesting cases 474 00:28:09,340 --> 00:28:13,920 like the patient quit smoking two days ago. 475 00:28:17,440 --> 00:28:19,570 Really? 476 00:28:19,570 --> 00:28:20,920 Are they a smoker or not? 477 00:28:23,540 --> 00:28:27,610 And also, aggregate judgment is things like cohort selection, 478 00:28:27,610 --> 00:28:29,800 where it's not every single thing that you need 479 00:28:29,800 --> 00:28:31,810 to know about this patient. 480 00:28:31,810 --> 00:28:36,250 You just need to know if they fit a certain pattern. 481 00:28:36,250 --> 00:28:39,190 So let me give you a little historical note. 482 00:28:39,190 --> 00:28:42,790 So this happened to be work that was done by my PhD thesis 483 00:28:42,790 --> 00:28:48,550 advisor, the gentleman whose picture is on the slide there. 484 00:28:48,550 --> 00:28:51,640 And he published this paper in 1966 485 00:28:51,640 --> 00:28:55,810 called English for the Computer in the Proceedings of the Fall 486 00:28:55,810 --> 00:28:57,130 Joint Computer Conference. 487 00:28:57,130 --> 00:29:02,710 This was the big computer conference of the 1960s. 488 00:29:02,710 --> 00:29:06,640 And his idea was that the way to do English, the way 489 00:29:06,640 --> 00:29:12,230 to process English is to assume that there is a grammar, 490 00:29:12,230 --> 00:29:15,470 and any English text that you run across, 491 00:29:15,470 --> 00:29:18,560 you parse according to this grammar. 492 00:29:18,560 --> 00:29:21,710 And that each parsing rule corresponds 493 00:29:21,710 --> 00:29:24,910 to some semantic function. 494 00:29:24,910 --> 00:29:29,790 And so the picture that emerges is one like this. 495 00:29:29,790 --> 00:29:32,310 Where if you have two phrases and they 496 00:29:32,310 --> 00:29:36,780 have some syntactic relationship between them, then you can 497 00:29:36,780 --> 00:29:40,110 map each phrase to its meaning. 498 00:29:40,110 --> 00:29:44,610 And the semantic relationship between those two meanings 499 00:29:44,610 --> 00:29:50,520 is determined by the syntactic relationship in the language. 500 00:29:50,520 --> 00:29:54,120 So this seems like a fairly obvious idea, 501 00:29:54,120 --> 00:29:58,500 but apparently nobody had tried this on a computer before. 502 00:29:58,500 --> 00:30:04,240 And so Fred built, over the next 20 years, computer systems, 503 00:30:04,240 --> 00:30:09,450 some of which I worked on that tried to follow this method. 504 00:30:09,450 --> 00:30:13,500 And he was, in fact, able to build systems 505 00:30:13,500 --> 00:30:18,090 that were used by researchers in areas like anthropology, 506 00:30:18,090 --> 00:30:22,230 where you don't have nice coded data and where a lot of stuff 507 00:30:22,230 --> 00:30:24,450 is in narrative text. 508 00:30:24,450 --> 00:30:28,390 And yet he was able to help one anthropologist 509 00:30:28,390 --> 00:30:33,620 that I worked with at Caltech to analyze a database of about 510 00:30:33,620 --> 00:30:39,390 80,000 interviews that he had done with members of the Gwembe 511 00:30:39,390 --> 00:30:43,350 Tonga tribe, who lived in the valley that is now flooded 512 00:30:43,350 --> 00:30:48,300 by the Zambezi River Reservoir on the border of Zambia 513 00:30:48,300 --> 00:30:50,280 and Zimbabwe. 514 00:30:50,280 --> 00:30:51,540 That was fascinating. 515 00:30:51,540 --> 00:30:54,240 Again, he became very well known for some of that research. 516 00:30:58,480 --> 00:31:04,360 In the 1980s I was amused to see that SRI-- 517 00:31:04,360 --> 00:31:05,950 which doesn't stand for anything, 518 00:31:05,950 --> 00:31:10,840 but used to stand for Stanford Research Institute-- 519 00:31:10,840 --> 00:31:14,290 built a system called Diamond Diagram, 520 00:31:14,290 --> 00:31:22,000 which was intended to help people 521 00:31:22,000 --> 00:31:24,250 interact with the computer system 522 00:31:24,250 --> 00:31:27,790 when they didn't know a command language for the computer. 523 00:31:27,790 --> 00:31:30,880 So they could express what they wanted to do in English 524 00:31:30,880 --> 00:31:32,710 and the English would be translated 525 00:31:32,710 --> 00:31:35,260 into some semantic representation. 526 00:31:35,260 --> 00:31:38,815 And from that, the right thing was triggered in the computer. 527 00:31:41,540 --> 00:31:47,420 So these guys, Walker and Hobbs, said, well, 528 00:31:47,420 --> 00:31:50,710 why don't we apply this idea to natural language access 529 00:31:50,710 --> 00:31:52,780 to medical text? 530 00:31:52,780 --> 00:31:57,310 And so they built a system that didn't work very well, 531 00:31:57,310 --> 00:32:01,600 but it tried to do this by essentially translating 532 00:32:01,600 --> 00:32:05,680 the English that it was reading into some formal predicate 533 00:32:05,680 --> 00:32:12,760 calculus representation of what they saw, and then 534 00:32:12,760 --> 00:32:15,040 a process for that system. 535 00:32:18,580 --> 00:32:22,960 The original Diamond Diagram system 536 00:32:22,960 --> 00:32:26,710 that was built for people who were naive computer users 537 00:32:26,710 --> 00:32:29,230 and didn't know command languages 538 00:32:29,230 --> 00:32:32,420 actually had a very rigid syntax. 539 00:32:32,420 --> 00:32:36,400 And so what they discovered is that people 540 00:32:36,400 --> 00:32:39,490 are more adaptable than computers 541 00:32:39,490 --> 00:32:46,000 and that they could adapt to this rigid syntax. 542 00:32:46,000 --> 00:32:52,240 How many of you have Google Home or Amazon Echo 543 00:32:52,240 --> 00:32:56,300 or Apple something or other that you deal with? 544 00:32:56,300 --> 00:33:00,070 Well, so it's training you, right? 545 00:33:00,070 --> 00:33:04,550 Because it's not very good at letting you train it, 546 00:33:04,550 --> 00:33:06,790 but you're more adaptable. 547 00:33:06,790 --> 00:33:10,810 And so you quickly learn that if you phrase things one way, 548 00:33:10,810 --> 00:33:13,600 it understands you, and if you phrase things a different way, 549 00:33:13,600 --> 00:33:15,160 it doesn't understand you. 550 00:33:15,160 --> 00:33:17,500 And you learn how to phrase it. 551 00:33:17,500 --> 00:33:20,590 So that's what these guys are relying on, 552 00:33:20,590 --> 00:33:24,970 is that they can get people to adopt 553 00:33:24,970 --> 00:33:29,140 the conventions that the computer is able to understand. 554 00:33:29,140 --> 00:33:32,080 The most radical version of this was 555 00:33:32,080 --> 00:33:37,540 a guy named de Heaulme, who I met in 1983 in Paris. 556 00:33:37,540 --> 00:33:40,530 He was a doctor Le Pitie Salpetriere, 557 00:33:40,530 --> 00:33:45,360 which is one of these medieval hospitals in Paris. 558 00:33:45,360 --> 00:33:48,190 And it's wonderful place, although when they built it, 559 00:33:48,190 --> 00:33:51,670 it was just a place to die because they really 560 00:33:51,670 --> 00:33:53,830 couldn't do much for you. 561 00:33:53,830 --> 00:33:59,890 So de Heaulme convinced the chief of cardiology 562 00:33:59,890 --> 00:34:04,450 at that hospital that he would develop an artificial language 563 00:34:04,450 --> 00:34:08,110 for taking notes about cardiac patients. 564 00:34:08,110 --> 00:34:11,440 He would teach this to all of the fellows 565 00:34:11,440 --> 00:34:15,790 and junior doctors in the cardiology department 566 00:34:15,790 --> 00:34:17,230 at the hospital. 567 00:34:17,230 --> 00:34:20,995 And they would be required by the chief, which 568 00:34:20,995 --> 00:34:25,600 is very powerful in France, to use this artificial language 569 00:34:25,600 --> 00:34:30,820 to write notes instead of using French to write notes. 570 00:34:30,820 --> 00:34:33,980 And they actually did this for a month. 571 00:34:33,980 --> 00:34:35,800 And when I met de Heaulme, he was 572 00:34:35,800 --> 00:34:39,820 in the middle of analyzing the data that he had collected. 573 00:34:39,820 --> 00:34:44,860 And what he found was that the language was not 574 00:34:44,860 --> 00:34:46,300 expressive enough. 575 00:34:46,300 --> 00:34:48,429 There were things that people wanted 576 00:34:48,429 --> 00:34:52,239 to say that they couldn't say in this artificial language he 577 00:34:52,239 --> 00:34:53,770 had created. 578 00:34:53,770 --> 00:34:57,350 And so he went back to create version two, 579 00:34:57,350 --> 00:34:59,960 and then he went back to the cardiologist and said, 580 00:34:59,960 --> 00:35:01,588 well, let's do this again. 581 00:35:01,588 --> 00:35:03,130 And then they threatened to kill him. 582 00:35:06,200 --> 00:35:11,680 So the experiment was not repeated. 583 00:35:11,680 --> 00:35:13,940 OK, so back to term spotting. 584 00:35:18,480 --> 00:35:20,790 Traditionally, if you were trying 585 00:35:20,790 --> 00:35:23,400 to do this, what you would do is you would sit down 586 00:35:23,400 --> 00:35:27,600 with a bunch of medical experts and you would say, all right, 587 00:35:27,600 --> 00:35:30,120 tell me all the words that you think 588 00:35:30,120 --> 00:35:33,930 might appear in a note that are indicative of some condition 589 00:35:33,930 --> 00:35:35,670 that I'm interested in. 590 00:35:35,670 --> 00:35:37,830 And they would give you a long list. 591 00:35:37,830 --> 00:35:40,950 And then you'd do grep, you'd search through the notes 592 00:35:40,950 --> 00:35:42,810 for those terms. 593 00:35:42,810 --> 00:35:43,590 OK? 594 00:35:43,590 --> 00:35:46,560 And if you want it to be really sophisticated, 595 00:35:46,560 --> 00:35:49,110 you would use an algorithm like NegEx, 596 00:35:49,110 --> 00:35:55,200 which is a negation expression detector that helps get rid 597 00:35:55,200 --> 00:35:58,470 of things that are not true. 598 00:35:58,470 --> 00:36:04,500 And then, as people did this, they 599 00:36:04,500 --> 00:36:07,650 said, well, there must be more sophisticated ways 600 00:36:07,650 --> 00:36:09,100 of doing this. 601 00:36:09,100 --> 00:36:12,210 And so a whole industry developed of people 602 00:36:12,210 --> 00:36:17,870 saying that not only should we use the terms 603 00:36:17,870 --> 00:36:20,450 that we got originally from the doctors who 604 00:36:20,450 --> 00:36:22,970 were interested in doing these queries, 605 00:36:22,970 --> 00:36:26,800 but we can define a machine learning problem, which 606 00:36:26,800 --> 00:36:29,980 is how do we learn the set of terms 607 00:36:29,980 --> 00:36:33,160 that we should actually use that will give us 608 00:36:33,160 --> 00:36:36,820 better results than just the terms we started with? 609 00:36:36,820 --> 00:36:42,210 And so I'm going to talk about a little bit of that approach. 610 00:36:42,210 --> 00:36:47,700 First of all, for negation, Wendy Chapman, now at Utah, 611 00:36:47,700 --> 00:36:52,620 but at the time at Pittsburgh, published this paper in 2001 612 00:36:52,620 --> 00:36:55,590 called A Simple Algorithm for Identifying 613 00:36:55,590 --> 00:36:59,070 the Gated Findings of Diseases in Discharge Summaries. 614 00:36:59,070 --> 00:37:02,340 And it is indeed a very simple algorithm. 615 00:37:02,340 --> 00:37:04,470 And here's how it works. 616 00:37:04,470 --> 00:37:07,710 You find all the UMLS terms in each sentence 617 00:37:07,710 --> 00:37:09,320 of a discharge summary. 618 00:37:09,320 --> 00:37:11,980 So I'll talk a little bit about that. 619 00:37:11,980 --> 00:37:15,420 But basically, it's a dictionary look up. 620 00:37:15,420 --> 00:37:19,950 You look up in this very large database of medical terms 621 00:37:19,950 --> 00:37:23,940 and translate them into some kind of expression 622 00:37:23,940 --> 00:37:28,390 that represents what that term means. 623 00:37:28,390 --> 00:37:31,870 And then you find two kinds of patterns. 624 00:37:31,870 --> 00:37:36,720 One pattern is a negation phrase followed within five words 625 00:37:36,720 --> 00:37:39,910 by one of these UMLS terms. 626 00:37:39,910 --> 00:37:45,340 And the other is a UMLS term followed within five words 627 00:37:45,340 --> 00:37:49,400 by a negation phrase, different set of negation phrases. 628 00:37:49,400 --> 00:37:53,200 So if you see no sign of something, 629 00:37:53,200 --> 00:37:55,180 that means it's not present. 630 00:37:55,180 --> 00:37:58,990 Or if you see ruled out, unlikely something, then it's 631 00:37:58,990 --> 00:38:00,160 not present. 632 00:38:00,160 --> 00:38:04,150 Absence of, not demonstrated, denies, et cetera. 633 00:38:04,150 --> 00:38:08,500 And post modifiers if you say something declined or something 634 00:38:08,500 --> 00:38:12,130 unlikely, that also indicates that it's not present. 635 00:38:15,360 --> 00:38:20,750 And then they hacked up a bunch of exceptions where, 636 00:38:20,750 --> 00:38:24,790 for example, if you say gram negative, that 637 00:38:24,790 --> 00:38:28,570 doesn't mean that it's negative for whatever follows it 638 00:38:28,570 --> 00:38:32,160 or whatever precedes it, right? 639 00:38:32,160 --> 00:38:32,740 Et cetera. 640 00:38:32,740 --> 00:38:35,260 So there are a bunch of exceptions. 641 00:38:35,260 --> 00:38:38,140 And what they found is that this actually, 642 00:38:38,140 --> 00:38:41,680 considering how incredibly simple it is, 643 00:38:41,680 --> 00:38:43,760 does reasonably well. 644 00:38:43,760 --> 00:38:49,630 So if you look at sentences that do not contain a negation 645 00:38:49,630 --> 00:38:53,450 phrase and looked at 500 of them, 646 00:38:53,450 --> 00:38:56,620 you find that you get a sensitivity and specificity 647 00:38:56,620 --> 00:39:01,000 of 88% and 52% for those that don't 648 00:39:01,000 --> 00:39:03,370 contain one of these phrases. 649 00:39:03,370 --> 00:39:06,880 Of course, the sensitivity is 0 and the specificity 650 00:39:06,880 --> 00:39:11,300 is 100% on the baseline. 651 00:39:11,300 --> 00:39:13,900 And if you use NegEx, what you find 652 00:39:13,900 --> 00:39:18,910 is that you can significantly improve the specificity 653 00:39:18,910 --> 00:39:21,000 over the baseline. 654 00:39:21,000 --> 00:39:22,880 All right? 655 00:39:22,880 --> 00:39:29,210 And you wind up with a better result, 656 00:39:29,210 --> 00:39:31,310 although not in all schemes. 657 00:39:31,310 --> 00:39:35,240 So what this means is that very simplistic techniques 658 00:39:35,240 --> 00:39:37,460 can actually work reasonably well at times. 659 00:39:40,410 --> 00:39:42,920 So how do we do this generalization? 660 00:39:42,920 --> 00:39:46,940 One way is to take advantage of related terms like hypo- 661 00:39:46,940 --> 00:39:50,120 or hypernyms, things that are subcategories 662 00:39:50,120 --> 00:39:51,920 or super categories of a word. 663 00:39:51,920 --> 00:39:55,650 You might look for those other associated terms. 664 00:39:55,650 --> 00:39:59,390 For example, if you're looking to see whether a patient has 665 00:39:59,390 --> 00:40:01,790 a certain disease, then you can do 666 00:40:01,790 --> 00:40:04,610 a little bit of diagnostic reasoning and say, 667 00:40:04,610 --> 00:40:08,000 if I see a lot of symptoms of that disease mentioned, 668 00:40:08,000 --> 00:40:12,200 then maybe the disease is present as well. 669 00:40:12,200 --> 00:40:15,980 So the recursive machine learning problem 670 00:40:15,980 --> 00:40:19,850 is how best to identify the things associated 671 00:40:19,850 --> 00:40:20,690 with the term. 672 00:40:20,690 --> 00:40:23,655 And this is generally known as phenotyping. 673 00:40:26,840 --> 00:40:33,780 Now, how many of you have used the UMLS? 674 00:40:33,780 --> 00:40:35,610 Just a few. 675 00:40:35,610 --> 00:40:43,440 So in 1985 or '84, the newly appointed director 676 00:40:43,440 --> 00:40:44,970 of the National Library of Medicine, 677 00:40:44,970 --> 00:40:47,490 which is one of the NIH institutes, 678 00:40:47,490 --> 00:40:51,210 decided to make a big investment in creating 679 00:40:51,210 --> 00:40:55,140 this unified medical language system, which was an attempt 680 00:40:55,140 --> 00:40:59,400 to take all of the terminologies that various medical 681 00:40:59,400 --> 00:41:02,040 professional societies had developed 682 00:41:02,040 --> 00:41:04,560 and unify them into a single, what 683 00:41:04,560 --> 00:41:07,450 they called a meta-thesaurus. 684 00:41:07,450 --> 00:41:11,860 So it's not really a thesaurus because it's not 685 00:41:11,860 --> 00:41:14,770 completely well integrated, but it does include 686 00:41:14,770 --> 00:41:16,810 all of this terminology. 687 00:41:16,810 --> 00:41:19,780 And then they spent a lot of both human and machine 688 00:41:19,780 --> 00:41:23,590 resources in order to identify cases 689 00:41:23,590 --> 00:41:25,930 in which two different expressions 690 00:41:25,930 --> 00:41:30,230 from different terminologies really meant the same thing. 691 00:41:30,230 --> 00:41:34,030 So for example, myocardial infarction and heart attack 692 00:41:34,030 --> 00:41:36,560 really mean exactly the same thing. 693 00:41:36,560 --> 00:41:38,470 And in some terminologies, it's called 694 00:41:38,470 --> 00:41:43,480 acute myocardial infarction or acute infarct 695 00:41:43,480 --> 00:41:46,240 or acute, you know, whatever. 696 00:41:46,240 --> 00:41:50,560 And they paid people and they paid machines 697 00:41:50,560 --> 00:41:55,060 to scour those entire databases and come up 698 00:41:55,060 --> 00:41:57,580 with the mapping that said, OK, we're 699 00:41:57,580 --> 00:42:03,310 going to have some concept, you know, see 398752-- 700 00:42:03,310 --> 00:42:05,680 I just made that up-- which corresponds 701 00:42:05,680 --> 00:42:08,690 to that particular concept. 702 00:42:08,690 --> 00:42:11,060 And then they mapped all those together. 703 00:42:11,060 --> 00:42:13,240 So that's an enormous help in two ways. 704 00:42:13,240 --> 00:42:20,140 It helps you normalize databases that come from different places 705 00:42:20,140 --> 00:42:22,570 and that are described differently. 706 00:42:22,570 --> 00:42:26,620 It also tells you, for natural language processing, 707 00:42:26,620 --> 00:42:27,700 how it is-- 708 00:42:27,700 --> 00:42:31,600 it gives you a treasure trove of ways of expressing 709 00:42:31,600 --> 00:42:34,030 the same conceptual idea. 710 00:42:34,030 --> 00:42:36,100 And then you can use those in order 711 00:42:36,100 --> 00:42:40,790 to expand the kinds of phrases that you're looking for. 712 00:42:40,790 --> 00:42:45,580 So there are, as of the current moment, 713 00:42:45,580 --> 00:42:49,870 there are about 3.7 million distinct concepts 714 00:42:49,870 --> 00:42:54,790 in this concept base. 715 00:42:54,790 --> 00:42:58,900 There are also hierarchies and relationships 716 00:42:58,900 --> 00:43:02,560 that are imported from all these different sources 717 00:43:02,560 --> 00:43:07,840 of terminology, but those are a pretty jumbled mess. 718 00:43:07,840 --> 00:43:12,390 And then over the whole thing, they created a semantic network 719 00:43:12,390 --> 00:43:18,270 that says there are 54 relations and 127 types, 720 00:43:18,270 --> 00:43:21,300 and every concept unique identifier 721 00:43:21,300 --> 00:43:24,060 is assigned at least one semantic type. 722 00:43:24,060 --> 00:43:28,560 So this is very useful for looking through this stuff. 723 00:43:28,560 --> 00:43:37,320 Here are the UMLS semantic concepts of various-- 724 00:43:37,320 --> 00:43:39,300 or the semantic types. 725 00:43:39,300 --> 00:43:43,080 So you see that the most common semantic type 726 00:43:43,080 --> 00:43:47,280 is this T061, which stands for therapeutic or preventive 727 00:43:47,280 --> 00:43:48,610 procedure. 728 00:43:48,610 --> 00:43:55,800 And there are 260,000 of those concepts in the meta-thesaurus. 729 00:43:55,800 --> 00:44:02,400 There are 233,000 findings, 172,000 drugs, 730 00:44:02,400 --> 00:44:06,270 organic chemicals, pharmacological substances, 731 00:44:06,270 --> 00:44:09,960 amino acid peptide or protein, invertebrate. 732 00:44:09,960 --> 00:44:14,070 So the data does not come only from human medicine 733 00:44:14,070 --> 00:44:17,880 but also from veterinary medicine and bioinformatics 734 00:44:17,880 --> 00:44:20,310 research and all over the place. 735 00:44:20,310 --> 00:44:23,640 But you see that these are a useful listing 736 00:44:23,640 --> 00:44:31,650 of appropriate semantic types that you can then 737 00:44:31,650 --> 00:44:34,950 look for in such a database. 738 00:44:34,950 --> 00:44:38,820 And the types are hierarchically organized. 739 00:44:38,820 --> 00:44:42,720 So for example, the relations are 740 00:44:42,720 --> 00:44:45,780 organized so there's an effects relation which 741 00:44:45,780 --> 00:44:50,070 has sub-relations, manages, treats, disrupts, complicates, 742 00:44:50,070 --> 00:44:52,980 interacts with, or prevents. 743 00:44:52,980 --> 00:44:55,020 Something like biological function 744 00:44:55,020 --> 00:44:59,560 can be a physiologic function or a pathologic function. 745 00:44:59,560 --> 00:45:02,890 And again, each of these has subcategories. 746 00:45:02,890 --> 00:45:06,750 So the idea is that each concept, each unique concept 747 00:45:06,750 --> 00:45:10,500 is labeled with at least one of these semantic types, 748 00:45:10,500 --> 00:45:13,410 and that helps to identify things when you're 749 00:45:13,410 --> 00:45:16,650 looking through the data. 750 00:45:16,650 --> 00:45:19,380 There are also some tools that deal 751 00:45:19,380 --> 00:45:21,540 with the typical linguistic problems, 752 00:45:21,540 --> 00:45:28,410 that if I want to say bleeds or bleed or bleeding, 753 00:45:28,410 --> 00:45:30,600 those are really all the same concept. 754 00:45:30,600 --> 00:45:34,020 And so there are these lexical variant generator 755 00:45:34,020 --> 00:45:36,150 that helps us normalize that. 756 00:45:36,150 --> 00:45:39,300 And then there is the normalization function 757 00:45:39,300 --> 00:45:44,100 that takes some statement like Mr. Huntington was admitted, 758 00:45:44,100 --> 00:45:46,740 blah, blah, blah, and normalizes it 759 00:45:46,740 --> 00:45:54,450 into lowercase alphabetized versions of the text, where 760 00:45:54,450 --> 00:46:01,960 things are translated into other potential meanings, 761 00:46:01,960 --> 00:46:03,930 linguistic meanings of that text. 762 00:46:03,930 --> 00:46:07,480 So for example, notice this one says was, 763 00:46:07,480 --> 00:46:11,470 but one of its translations is be because was 764 00:46:11,470 --> 00:46:14,500 is just a form of be. 765 00:46:14,500 --> 00:46:16,750 This can also get you in trouble. 766 00:46:16,750 --> 00:46:20,830 I ran into a problem where I was finding beryllium 767 00:46:20,830 --> 00:46:25,120 in everybody's medical records because it also 768 00:46:25,120 --> 00:46:30,120 knows that b-e is an abbreviation for beryllium. 769 00:46:30,120 --> 00:46:32,650 And so you have to be a little careful 770 00:46:32,650 --> 00:46:34,860 about how you use this stuff. 771 00:46:34,860 --> 00:46:39,130 There is an online tool where you can type in something 772 00:46:39,130 --> 00:46:42,350 and it says weakness of the upper extremities. 773 00:46:42,350 --> 00:46:45,670 And it says, oh, you mean the concept proximal 774 00:46:45,670 --> 00:46:48,640 weakness, upper extremities. 775 00:46:48,640 --> 00:46:52,800 And then it has a relationship to various contexts 776 00:46:52,800 --> 00:46:56,460 and it has siblings and it has all kinds of other things 777 00:46:56,460 --> 00:46:59,530 that one can look up. 778 00:46:59,530 --> 00:47:02,380 I built a tool a few years ago where 779 00:47:02,380 --> 00:47:07,320 if you populated with one of the short summaries, 780 00:47:07,320 --> 00:47:11,490 it tries to color code the types of things 781 00:47:11,490 --> 00:47:14,040 that it found in that summary. 782 00:47:14,040 --> 00:47:16,110 And so this is using a tool called 783 00:47:16,110 --> 00:47:21,120 MetaMap, which again comes from the National 784 00:47:21,120 --> 00:47:26,160 Library of Medicine, and a locally built UMLS 785 00:47:26,160 --> 00:47:29,070 look up tool that in this particular case 786 00:47:29,070 --> 00:47:34,210 finds exactly the same mappings from the text. 787 00:47:34,210 --> 00:47:37,680 And so you can look through the text and say, ah, OK, 788 00:47:37,680 --> 00:47:41,430 so no indicates negation and urine output 789 00:47:41,430 --> 00:47:44,400 is a kind of one of these concepts. 790 00:47:44,400 --> 00:47:48,550 If you moused over it, it would show you. 791 00:47:48,550 --> 00:47:53,790 OK, I think what I'm going to do is stop there today 792 00:47:53,790 --> 00:48:00,750 so that I can invite Kat to join us and talk about A, what's 793 00:48:00,750 --> 00:48:05,820 happened since 2010, and B, how is this stuff actually used 794 00:48:05,820 --> 00:48:09,370 by clinicians and clinician researchers. 795 00:48:09,370 --> 00:48:10,630 Kat? 796 00:48:10,630 --> 00:48:12,447 OK, well, welcome, Kat. 797 00:48:12,447 --> 00:48:13,530 KATHERINE LIAO: Thank you. 798 00:48:13,530 --> 00:48:15,810 PETER SZOLOVITS: Nice to see you again. 799 00:48:15,810 --> 00:48:20,640 So are the techniques that were represented 800 00:48:20,640 --> 00:48:23,010 in that paper from nine years ago 801 00:48:23,010 --> 00:48:26,365 still being used today in research settings? 802 00:48:26,365 --> 00:48:27,240 KATHERINE LIAO: Yeah. 803 00:48:27,240 --> 00:48:31,690 So I'd say yes, the bare bones of platform-- 804 00:48:31,690 --> 00:48:32,950 that pipeline is being used. 805 00:48:32,950 --> 00:48:35,850 But now I'd say we're in version five. 806 00:48:35,850 --> 00:48:38,700 Actually, you were on that revision list. 807 00:48:38,700 --> 00:48:41,310 But we've done a lot of improvements 808 00:48:41,310 --> 00:48:43,200 to actually automate things a little more. 809 00:48:43,200 --> 00:48:45,660 So the rate limiting factor in phenotyping 810 00:48:45,660 --> 00:48:46,890 is always the clinician. 811 00:48:46,890 --> 00:48:49,120 Always getting that label, doing the chart review, 812 00:48:49,120 --> 00:48:50,648 coming up with that term list. 813 00:48:50,648 --> 00:48:53,190 So I don't know if you want me to go into some of the details 814 00:48:53,190 --> 00:48:54,400 on what we've been doing. 815 00:48:54,400 --> 00:48:55,900 PETER SZOLOVITS: Yeah, if you would. 816 00:48:55,900 --> 00:48:57,460 KATHERINE LIAO: Kind of plugs it in. 817 00:48:57,460 --> 00:48:59,430 So if you recall that diagram, there 818 00:48:59,430 --> 00:49:01,710 were several steps, where you started with the EMR. 819 00:49:01,710 --> 00:49:04,440 There was that filter with the ICD codes. 820 00:49:04,440 --> 00:49:09,720 Then you get this data mart, and then you start training. 821 00:49:09,720 --> 00:49:13,530 You had to select a random 500, which is a lot. 822 00:49:13,530 --> 00:49:16,300 It's a lot of chart review to do. 823 00:49:16,300 --> 00:49:17,060 It is a lot. 824 00:49:17,060 --> 00:49:20,230 So our goal was to reduce that amount of chart review. 825 00:49:20,230 --> 00:49:22,630 And part of the way to reduce that 826 00:49:22,630 --> 00:49:24,130 is reducing the feature space. 827 00:49:24,130 --> 00:49:26,380 So one of the things that we didn't know when we first 828 00:49:26,380 --> 00:49:29,740 started out was how many gold standard labels did we need 829 00:49:29,740 --> 00:49:31,332 and how many features did we need 830 00:49:31,332 --> 00:49:33,290 and which of those features would be important. 831 00:49:33,290 --> 00:49:36,640 So by features, I mean ICD codes, a diagnosis code, 832 00:49:36,640 --> 00:49:39,940 medications, and all that list of NLP terms 833 00:49:39,940 --> 00:49:42,210 that might be related to the condition. 834 00:49:42,210 --> 00:49:44,710 And so now we have ways to try to whittle down that list 835 00:49:44,710 --> 00:49:48,130 before we even use those gold standard labels. 836 00:49:48,130 --> 00:49:50,770 And so let me think about-- 837 00:49:50,770 --> 00:49:51,550 this is NLP. 838 00:49:51,550 --> 00:49:52,780 The focus here is on NLP. 839 00:49:52,780 --> 00:49:54,738 So there are a couple of ways we're doing this. 840 00:49:54,738 --> 00:49:58,270 So one rate limiting step was getting the clinicians 841 00:49:58,270 --> 00:50:00,190 to come up with a list of terms that 842 00:50:00,190 --> 00:50:02,290 are important for a certain condition. 843 00:50:02,290 --> 00:50:05,080 You can imagine if you get five doctors in a room 844 00:50:05,080 --> 00:50:08,360 to try to agree on a list, it takes forever. 845 00:50:08,360 --> 00:50:10,250 And so we tried to get that out of the way. 846 00:50:10,250 --> 00:50:11,830 So one thing we started doing was we 847 00:50:11,830 --> 00:50:16,690 took just common things that are freely available on the web. 848 00:50:16,690 --> 00:50:20,380 Wikipedia, Medline, the Merck Manual 849 00:50:20,380 --> 00:50:22,150 that have medical information. 850 00:50:22,150 --> 00:50:25,510 And we actually now process those articles, 851 00:50:25,510 --> 00:50:27,750 look for medical terms, pull those out, 852 00:50:27,750 --> 00:50:30,760 map them to concepts, and that becomes that term list. 853 00:50:30,760 --> 00:50:31,690 Now, that goes into-- 854 00:50:31,690 --> 00:50:34,550 so now instead of, if you think about in the old days, 855 00:50:34,550 --> 00:50:37,690 we came up with the list, we had ICD lists and term lists, 856 00:50:37,690 --> 00:50:39,370 which got mapped to a concept. 857 00:50:39,370 --> 00:50:41,630 Now we go straight to the article. 858 00:50:41,630 --> 00:50:43,630 We kind of do majority voting with the articles. 859 00:50:43,630 --> 00:50:45,490 We take five articles, if three out of five 860 00:50:45,490 --> 00:50:47,548 mention it more than x amount of time, 861 00:50:47,548 --> 00:50:49,340 we say that could potentially be important. 862 00:50:49,340 --> 00:50:50,530 So that's the term list. 863 00:50:50,530 --> 00:50:53,950 Get the clinicians out of that step. 864 00:50:53,950 --> 00:50:55,570 Well, actually, we don't train yet. 865 00:50:55,570 --> 00:50:57,220 So now instead of training right away 866 00:50:57,220 --> 00:50:58,990 in the gold standard labels, we train 867 00:50:58,990 --> 00:51:02,320 on a silver standard label. 868 00:51:02,320 --> 00:51:05,070 Most of the time, we use the main ICD code, 869 00:51:05,070 --> 00:51:08,890 but sometimes we use the main NLP [INAUDIBLE] 870 00:51:08,890 --> 00:51:11,260 Because sometimes there is no code for the phenotype 871 00:51:11,260 --> 00:51:12,700 we're interested in. 872 00:51:12,700 --> 00:51:15,490 So that's kind of some of the steps 873 00:51:15,490 --> 00:51:18,070 that we've done to automate things a little bit more 874 00:51:18,070 --> 00:51:19,930 and formalize that pipeline. 875 00:51:19,930 --> 00:51:23,890 So in fact, the pipeline is now part 876 00:51:23,890 --> 00:51:27,640 of the Partners Biobank, which is a Partner's Healthcare. 877 00:51:27,640 --> 00:51:30,430 As Pete mentioned, it's Mass General and Brigham Women's 878 00:51:30,430 --> 00:51:31,570 Hospital. 879 00:51:31,570 --> 00:51:36,130 They are recruiting patients to come in and get the blood 880 00:51:36,130 --> 00:51:39,250 sample, link it with their notes so people can do research 881 00:51:39,250 --> 00:51:42,230 on linked EHR data and blood sample. 882 00:51:42,230 --> 00:51:45,640 So this is the pipeline they used for phenotyping. 883 00:51:45,640 --> 00:51:48,720 Now I'm over at the Boston VA along with Tianxi. 884 00:51:48,720 --> 00:51:50,470 And this is the pipeline we're laying down 885 00:51:50,470 --> 00:51:53,470 for also the Million Veterans program, which is even bigger. 886 00:51:53,470 --> 00:51:55,870 It's a million vets and they have 887 00:51:55,870 --> 00:51:58,300 EHR data going back decades. 888 00:51:58,300 --> 00:52:01,120 So it's pretty exciting. 889 00:52:01,120 --> 00:52:03,880 PETER SZOLOVITS: So what are the kinds of-- 890 00:52:03,880 --> 00:52:07,370 I mean, this study that we were talking about today 891 00:52:07,370 --> 00:52:09,360 was for rheumatoid arthritis. 892 00:52:09,360 --> 00:52:13,330 What other diseases are being targeted by this phenotyping 893 00:52:13,330 --> 00:52:13,830 approach? 894 00:52:13,830 --> 00:52:17,410 KATHERINE LIAO: So all kinds of diseases. 895 00:52:17,410 --> 00:52:19,130 There's a lot of things we learn, though. 896 00:52:19,130 --> 00:52:23,090 The phenotyping approach is best suited, the pipeline 897 00:52:23,090 --> 00:52:25,370 that we-- the base pipeline is best 898 00:52:25,370 --> 00:52:29,128 suited for conditions that have a prevalence of 1% or higher. 899 00:52:29,128 --> 00:52:31,420 So rheumatoid arthritis is kind of at that lower bound. 900 00:52:31,420 --> 00:52:34,470 Rheumatoid arthritis is a chronic inflammatory joint 901 00:52:34,470 --> 00:52:34,970 disease. 902 00:52:34,970 --> 00:52:37,700 It affects 1% of the population. 903 00:52:37,700 --> 00:52:41,030 But it is the most common autoimmune joint disease. 904 00:52:41,030 --> 00:52:43,550 Once you go to rare diseases that are 905 00:52:43,550 --> 00:52:46,890 episodic that don't happen-- 906 00:52:46,890 --> 00:52:48,800 you know, not only is it below 1%, 907 00:52:48,800 --> 00:52:50,840 but only happens once in a while-- 908 00:52:50,840 --> 00:52:55,130 this type of approach is not as robust. 909 00:52:55,130 --> 00:52:57,320 But most diseases are above 1%. 910 00:52:57,320 --> 00:53:01,520 So at the VA, we've kind of laid down this pipeline 911 00:53:01,520 --> 00:53:02,420 for a phonemic score. 912 00:53:02,420 --> 00:53:05,310 And they're running through acute stroke, 913 00:53:05,310 --> 00:53:10,350 myocardial infarction, all kinds of these-- diabetes-- 914 00:53:10,350 --> 00:53:13,130 just really a lot of all the common diseases 915 00:53:13,130 --> 00:53:14,652 that we want to study. 916 00:53:14,652 --> 00:53:16,360 PETER SZOLOVITS: Now, you were mentioning 917 00:53:16,360 --> 00:53:18,980 that when you identify such a patient, 918 00:53:18,980 --> 00:53:21,530 you then try to get a blood sample so that you 919 00:53:21,530 --> 00:53:23,420 can do genotyping on them. 920 00:53:23,420 --> 00:53:26,060 Is that also common across all these diseases 921 00:53:26,060 --> 00:53:27,520 or are there different approaches? 922 00:53:27,520 --> 00:53:29,270 KATHERINE LIAO: Yeah, so it's interesting. 923 00:53:29,270 --> 00:53:31,220 10 years ago, it was very different. 924 00:53:31,220 --> 00:53:33,720 It was very expensive to genotype a patient. 925 00:53:33,720 --> 00:53:37,007 It was anywhere between $500 to $700 per patient. 926 00:53:37,007 --> 00:53:39,340 PETER SZOLOVITS: And that was just for single nucleotide 927 00:53:39,340 --> 00:53:39,882 polymorphism. 928 00:53:39,882 --> 00:53:41,660 KATHERINE LIAO: Yes, just for a snip. 929 00:53:41,660 --> 00:53:43,980 So we had to be very careful about who we selected. 930 00:53:43,980 --> 00:53:47,100 So 10 years ago, what we did is we said, 931 00:53:47,100 --> 00:53:50,420 OK, we have 4 million patients and partners. 932 00:53:50,420 --> 00:53:52,560 Who has already with good certainty? 933 00:53:52,560 --> 00:53:55,220 Then we select those patients and we genotype them. 934 00:53:55,220 --> 00:53:56,720 Because it costs so much, you didn't 935 00:53:56,720 --> 00:53:59,600 want to genotype someone who didn't have RA. 936 00:53:59,600 --> 00:54:02,000 Not only would it alter the-- 937 00:54:02,000 --> 00:54:04,750 it would reduce the power of our association study, 938 00:54:04,750 --> 00:54:08,100 it would just be like wasted dollars. 939 00:54:08,100 --> 00:54:11,090 The interesting thing is that the change has happened. 940 00:54:11,090 --> 00:54:13,760 And we can completely think of a different way 941 00:54:13,760 --> 00:54:15,170 of approaching things. 942 00:54:15,170 --> 00:54:16,730 Now you have these biobanks. 943 00:54:16,730 --> 00:54:21,180 You have something like the VA MVP or UK Biobank. 944 00:54:21,180 --> 00:54:25,000 They are being systematically recruited, 945 00:54:25,000 --> 00:54:26,750 blood samples are taken, they're genotyped 946 00:54:26,750 --> 00:54:28,640 with no study in mind. 947 00:54:28,640 --> 00:54:30,020 Linked with the EHR. 948 00:54:30,020 --> 00:54:33,200 So now I walk into the VA, it's a completely different story. 949 00:54:33,200 --> 00:54:36,620 10 years later, I'm at the VA and I'm 950 00:54:36,620 --> 00:54:39,170 interested in identifying rheumatoid arthritis. 951 00:54:39,170 --> 00:54:40,940 Interesting enough, this algorithm 952 00:54:40,940 --> 00:54:42,560 ports well over there, too. 953 00:54:42,560 --> 00:54:46,010 But now we tested our new method on there. 954 00:54:46,010 --> 00:54:49,340 But now, instead of saying, I need to identify these patients 955 00:54:49,340 --> 00:54:52,920 and get the genotype, all the genotypes are already there. 956 00:54:52,920 --> 00:54:56,435 So it's a completely different approach to research now. 957 00:54:56,435 --> 00:54:57,970 PETER SZOLOVITS: Interesting. 958 00:54:57,970 --> 00:55:01,180 So the other question that I wanted 959 00:55:01,180 --> 00:55:02,890 to ask you before we turn it over 960 00:55:02,890 --> 00:55:05,050 to questions from the audience is, 961 00:55:05,050 --> 00:55:10,030 so this is all focused on research uses of the data. 962 00:55:10,030 --> 00:55:12,580 Are there clinical uses that people 963 00:55:12,580 --> 00:55:15,220 have adopted that use this kind of approach 964 00:55:15,220 --> 00:55:18,610 to trying to read the note? 965 00:55:18,610 --> 00:55:22,390 We had fantasized decades ago that, 966 00:55:22,390 --> 00:55:27,970 you know, when you get a report from a pathologist, 967 00:55:27,970 --> 00:55:31,660 that somehow or other, a machine learning algorithm 968 00:55:31,660 --> 00:55:33,550 using natural language processing 969 00:55:33,550 --> 00:55:37,150 would grovel over it, identify the important things that 970 00:55:37,150 --> 00:55:40,390 came out, and then either incorporate 971 00:55:40,390 --> 00:55:46,390 that in decision support or in some kind of warning systems 972 00:55:46,390 --> 00:55:50,050 that drew people's attention to the important results as 973 00:55:50,050 --> 00:55:52,570 opposed to the unimportant ones. 974 00:55:52,570 --> 00:55:54,037 Has any of that happened? 975 00:55:54,037 --> 00:55:55,870 KATHERINE LIAO: I think we're not there yet, 976 00:55:55,870 --> 00:55:59,690 but I feel like we're so much closer than we were before. 977 00:55:59,690 --> 00:56:03,490 That's probably how you felt a few decades ago. 978 00:56:03,490 --> 00:56:07,720 One of the challenges is, as you know, 979 00:56:07,720 --> 00:56:11,290 EHR weren't really widely adopted until the HITEx Act 980 00:56:11,290 --> 00:56:12,400 in 2010. 981 00:56:12,400 --> 00:56:16,120 So a lot of systems are actually now just getting their EHR. 982 00:56:16,120 --> 00:56:18,610 And the reason that we've had the luxury of playing around 983 00:56:18,610 --> 00:56:20,943 with the data is because Partners was ahead of the curve 984 00:56:20,943 --> 00:56:22,180 and had developed an EHR. 985 00:56:22,180 --> 00:56:25,800 The VA happened to have an EHR. 986 00:56:25,800 --> 00:56:26,830 But I think first-- 987 00:56:26,830 --> 00:56:30,460 because research and clinical medicine is very different. 988 00:56:30,460 --> 00:56:32,970 Research, if you mess up and you misclassify someone 989 00:56:32,970 --> 00:56:34,730 with a disease, it's OK, right? 990 00:56:34,730 --> 00:56:37,570 You just lose power in your study. 991 00:56:37,570 --> 00:56:39,880 But in the clinical setting, if you mess up, 992 00:56:39,880 --> 00:56:41,150 it's a really big deal. 993 00:56:41,150 --> 00:56:43,180 So I think the bar is much higher. 994 00:56:43,180 --> 00:56:46,510 And so one of our goals with all this phenotyping 995 00:56:46,510 --> 00:56:50,168 is to get it to that point where we feel pretty confident. 996 00:56:50,168 --> 00:56:52,460 We're not going to say someone has or hasn't a disease, 997 00:56:52,460 --> 00:56:55,540 but we are, you know, Tianxi and I 998 00:56:55,540 --> 00:56:57,830 have been planning this grant where, 999 00:56:57,830 --> 00:56:59,870 what's outputted from this algorithm 1000 00:56:59,870 --> 00:57:01,400 is a probability of disease. 1001 00:57:01,400 --> 00:57:04,640 And some of our phenotype algorithms are pretty good. 1002 00:57:04,640 --> 00:57:06,795 And so what we want to test is what threshold 1003 00:57:06,795 --> 00:57:08,420 is that probability that you would want 1004 00:57:08,420 --> 00:57:10,820 to tell a clinician that, hey, if you're not 1005 00:57:10,820 --> 00:57:14,010 thinking about rheumatoid arthritis in this patient-- 1006 00:57:14,010 --> 00:57:16,280 this is particularly helpful in places 1007 00:57:16,280 --> 00:57:19,355 where they're in remote locations 1008 00:57:19,355 --> 00:57:20,730 where there aren't rheumatologist 1009 00:57:20,730 --> 00:57:22,522 available-- you should be thinking about it 1010 00:57:22,522 --> 00:57:25,550 and maybe, you know, considering referring them 1011 00:57:25,550 --> 00:57:29,290 or speaking to a rheumatologist through telehealth, 1012 00:57:29,290 --> 00:57:30,290 which is also something. 1013 00:57:30,290 --> 00:57:31,998 There's a lot of things that are changing 1014 00:57:31,998 --> 00:57:35,600 that are making something like this fit 1015 00:57:35,600 --> 00:57:38,280 much more into the workflow. 1016 00:57:38,280 --> 00:57:39,920 PETER SZOLOVITS: Yeah. 1017 00:57:39,920 --> 00:57:43,797 So you're as optimistic as I was in the 1990s. 1018 00:57:43,797 --> 00:57:44,630 KATHERINE LIAO: Yes. 1019 00:57:44,630 --> 00:57:46,000 I think we're getting-- 1020 00:57:46,000 --> 00:57:49,060 we'll see. 1021 00:57:49,060 --> 00:57:52,610 PETER SZOLOVITS: Well, you know, it will surely 1022 00:57:52,610 --> 00:57:55,040 happen at some point. 1023 00:57:55,040 --> 00:57:57,200 Did any of you go to the festivities 1024 00:57:57,200 --> 00:57:58,880 around the opening of the Schwarzman 1025 00:57:58,880 --> 00:58:00,710 College of Computing? 1026 00:58:00,710 --> 00:58:02,570 So they've had a lot of discussions. 1027 00:58:02,570 --> 00:58:06,200 And health care does keep coming up over and over again 1028 00:58:06,200 --> 00:58:08,750 as one of the great opportunities. 1029 00:58:08,750 --> 00:58:11,130 I profoundly believe that. 1030 00:58:11,130 --> 00:58:13,760 But on the other hand, I've learned over many decades 1031 00:58:13,760 --> 00:58:18,290 not to be quite as optimistic as my natural proclivities are. 1032 00:58:18,290 --> 00:58:21,740 And I think some of the speakers here have not yet 1033 00:58:21,740 --> 00:58:23,600 learned that same lesson. 1034 00:58:23,600 --> 00:58:26,610 So things may take a little bit longer. 1035 00:58:26,610 --> 00:58:30,860 So let me open up the floor to questions. 1036 00:58:33,307 --> 00:58:34,140 KATHERINE LIAO: Yes? 1037 00:58:34,140 --> 00:58:36,450 AUDIENCE: So the mapping that you did to concepts, 1038 00:58:36,450 --> 00:58:38,990 is that within the Partners system 1039 00:58:38,990 --> 00:58:40,865 or is that something like publicly available? 1040 00:58:40,865 --> 00:58:43,040 And can you just transfer that to the VA? 1041 00:58:43,040 --> 00:58:47,310 Or like, when you do work like, how much is proprietary 1042 00:58:47,310 --> 00:58:49,075 and how much gets expanded up? 1043 00:58:49,075 --> 00:58:49,950 KATHERINE LIAO: Yeah. 1044 00:58:49,950 --> 00:58:51,660 So you're speaking about when we were trying 1045 00:58:51,660 --> 00:58:53,100 to create that term list and we mapped 1046 00:58:53,100 --> 00:58:53,995 the terms to the concepts? 1047 00:58:53,995 --> 00:58:55,500 AUDIENCE: And you were using Wikipedia 1048 00:58:55,500 --> 00:58:56,515 and three other sources. 1049 00:58:56,515 --> 00:58:57,390 KATHERINE LIAO: Yeah. 1050 00:58:57,390 --> 00:58:58,110 Yeah. 1051 00:58:58,110 --> 00:59:00,140 So that's all out there. 1052 00:59:00,140 --> 00:59:03,870 So as an academic group, we try to publish everything we do. 1053 00:59:03,870 --> 00:59:07,650 We put our codes up on GitHub or CRAN for other people 1054 00:59:07,650 --> 00:59:11,640 to play out and tests and break. 1055 00:59:11,640 --> 00:59:15,803 So yeah, the terms are really similar in UMLS. 1056 00:59:15,803 --> 00:59:17,970 I don't know if you had a chance to look through it. 1057 00:59:17,970 --> 00:59:19,710 They have a lot of keywords. 1058 00:59:19,710 --> 00:59:24,390 So there is a general way to map keywords to terms to concepts. 1059 00:59:24,390 --> 00:59:26,023 So that's the basis of what we do. 1060 00:59:26,023 --> 00:59:27,690 There may maybe a little bit more there, 1061 00:59:27,690 --> 00:59:29,730 but there's nothing fancy behind it. 1062 00:59:29,730 --> 00:59:32,280 And as you can imagine, because we're 1063 00:59:32,280 --> 00:59:33,780 trying to go across many phenotypes, 1064 00:59:33,780 --> 00:59:37,380 when we think about mapping, it always has to be automated. 1065 00:59:37,380 --> 00:59:41,400 Our first round was very manual, incredibly manual. 1066 00:59:41,400 --> 00:59:44,380 But now we try to use systems that are available 1067 00:59:44,380 --> 00:59:47,565 such as UMLS and other mapping methods. 1068 00:59:47,565 --> 00:59:48,960 PETER SZOLOVITS: So what map-- 1069 00:59:48,960 --> 00:59:51,030 presumably, you don't use HITex today. 1070 00:59:51,030 --> 00:59:52,510 KATHERINE LIAO: No. 1071 00:59:52,510 --> 00:59:54,960 PETER SZOLOVITS: So which tools do you use? 1072 00:59:54,960 --> 00:59:57,540 KATHERINE LIAO: Just thinking I had a two hour conversation 1073 00:59:57,540 --> 01:00:00,130 with Oakridge about this. 1074 01:00:00,130 --> 01:00:03,390 We're using a system that Cheng developed called NIAL. 1075 01:00:03,390 --> 01:00:06,180 And it had to do with the fact that cTAKES, which 1076 01:00:06,180 --> 01:00:12,120 is a really robust system, was just too computationally 1077 01:00:12,120 --> 01:00:13,480 intensive. 1078 01:00:13,480 --> 01:00:15,960 And for the purposes of phenotyping, 1079 01:00:15,960 --> 01:00:18,870 we didn't need that level of detail. 1080 01:00:18,870 --> 01:00:21,170 What we really needed was, was it mentioned, 1081 01:00:21,170 --> 01:00:23,100 what's the concept, and the negation. 1082 01:00:23,100 --> 01:00:25,230 And so NIAL is something that we've been using 1083 01:00:25,230 --> 01:00:28,980 and have kind of validated over time with the different methods 1084 01:00:28,980 --> 01:00:30,210 we've been testing. 1085 01:00:30,210 --> 01:00:33,420 PETER SZOLOVITS: So Tuesday, I'll 1086 01:00:33,420 --> 01:00:35,910 talk a little bit about that system 1087 01:00:35,910 --> 01:00:37,500 and some of its successors. 1088 01:00:37,500 --> 01:00:42,300 So you'll get a sense of how that works. 1089 01:00:42,300 --> 01:00:45,270 I should mention also that one of the papers that 1090 01:00:45,270 --> 01:00:48,270 was on your reading list is a paper out 1091 01:00:48,270 --> 01:00:53,760 of David Sontag's group, which uses this anchorous concept. 1092 01:00:53,760 --> 01:00:56,640 And that's very much along the same lines. 1093 01:00:56,640 --> 01:01:00,030 That it's a way of trying to automate, 1094 01:01:00,030 --> 01:01:01,650 just as Kat was saying, you know, 1095 01:01:01,650 --> 01:01:05,340 if the doctor's mention some term 1096 01:01:05,340 --> 01:01:09,300 and you discover that that term is very often used 1097 01:01:09,300 --> 01:01:12,240 with certain other terms by looking 1098 01:01:12,240 --> 01:01:14,940 at Wikipedia or at the Mayo Clinic data 1099 01:01:14,940 --> 01:01:18,750 or wherever your sources are, then 1100 01:01:18,750 --> 01:01:21,330 that's a good clue that that other term might also 1101 01:01:21,330 --> 01:01:22,620 be useful. 1102 01:01:22,620 --> 01:01:25,170 So this is a formalization of that idea 1103 01:01:25,170 --> 01:01:27,250 as a machine learning problem. 1104 01:01:27,250 --> 01:01:30,120 So basically, that paper talks about how 1105 01:01:30,120 --> 01:01:33,060 to take some very certain terms that 1106 01:01:33,060 --> 01:01:35,820 are highly indicative of a disease 1107 01:01:35,820 --> 01:01:39,300 and then use those as anchors in order 1108 01:01:39,300 --> 01:01:41,700 to train a machine learning model 1109 01:01:41,700 --> 01:01:46,320 that identifies more terms that are also likely to be useful. 1110 01:01:46,320 --> 01:01:47,980 So this notion of-- 1111 01:01:47,980 --> 01:01:53,190 and David talked about a similar idea in a previous lecture, 1112 01:01:53,190 --> 01:01:57,350 where you get a silver standard instead of a gold standard. 1113 01:01:57,350 --> 01:02:00,510 And the silver standard can be derived 1114 01:02:00,510 --> 01:02:04,170 from a smaller gold standard using some machine 1115 01:02:04,170 --> 01:02:05,490 learning algorithm. 1116 01:02:05,490 --> 01:02:10,895 And then you can use that in your further computations. 1117 01:02:10,895 --> 01:02:12,270 AUDIENCE: So what was the process 1118 01:02:12,270 --> 01:02:15,160 like for partnering with academics and machine learning? 1119 01:02:15,160 --> 01:02:17,390 So did you seek them out, did they seek you out? 1120 01:02:17,390 --> 01:02:20,400 Did you run into each other at the bus stop? 1121 01:02:20,400 --> 01:02:21,290 How does that work? 1122 01:02:21,290 --> 01:02:23,160 KATHERINE LIAO: Well, I was really lucky. 1123 01:02:23,160 --> 01:02:25,860 There was a big study called The Informatics for Integrating 1124 01:02:25,860 --> 01:02:29,580 Biology and the Bedside Project called i2B2 led by Zak Kohane 1125 01:02:29,580 --> 01:02:31,170 And so that was already in place. 1126 01:02:31,170 --> 01:02:33,240 And Pete had already been pulled in and Tianxi. 1127 01:02:33,240 --> 01:02:35,670 So what they basically did was locked all us 1128 01:02:35,670 --> 01:02:38,320 in a room for three hours every Friday. 1129 01:02:38,320 --> 01:02:40,830 And it was like, what's the problem, what's the question, 1130 01:02:40,830 --> 01:02:42,300 and how do we get there. 1131 01:02:42,300 --> 01:02:44,940 And so I think that infrastructure was so helpful 1132 01:02:44,940 --> 01:02:47,340 in bringing everyone to the table, 1133 01:02:47,340 --> 01:02:49,080 because it's not easy because you're not 1134 01:02:49,080 --> 01:02:50,490 rotating in the same space. 1135 01:02:50,490 --> 01:02:52,270 And the way you think is very different. 1136 01:02:52,270 --> 01:02:54,420 So that's how we did it. 1137 01:02:57,320 --> 01:02:58,380 Now it's more mainstream. 1138 01:02:58,380 --> 01:03:00,540 I think when we first started, everyone was-- 1139 01:03:00,540 --> 01:03:02,165 my colleagues joked with me. 1140 01:03:02,165 --> 01:03:03,540 They're like, what are you doing? 1141 01:03:03,540 --> 01:03:04,710 R2D2? 1142 01:03:04,710 --> 01:03:06,230 What's going on? 1143 01:03:06,230 --> 01:03:07,980 Are you going off the deep end over there? 1144 01:03:07,980 --> 01:03:09,813 Because you know, the type of research we do 1145 01:03:09,813 --> 01:03:12,850 was more along the ways of clinical trials and clin-epi 1146 01:03:12,850 --> 01:03:14,130 projects. 1147 01:03:14,130 --> 01:03:15,720 But now, you know, we have-- 1148 01:03:15,720 --> 01:03:18,000 I run a core at Brigham. 1149 01:03:18,000 --> 01:03:20,460 So it's run out of the rheumatology division. 1150 01:03:20,460 --> 01:03:24,330 And so we kind of try to connect people together. 1151 01:03:24,330 --> 01:03:30,270 I did post to our core the consulting session here. 1152 01:03:30,270 --> 01:03:32,820 But you know, if there is interest, 1153 01:03:32,820 --> 01:03:34,990 there's probably more groups that are doing this, 1154 01:03:34,990 --> 01:03:38,370 where we can kind of more formally have joint talks 1155 01:03:38,370 --> 01:03:43,490 or connect people together. 1156 01:03:43,490 --> 01:03:44,040 Yeah. 1157 01:03:44,040 --> 01:03:45,950 But it's not easy. 1158 01:03:45,950 --> 01:03:48,550 I have to say, it takes a lot of time. 1159 01:03:48,550 --> 01:03:51,220 Because when Pete put up that thing in what 1160 01:03:51,220 --> 01:03:53,520 looked like a different language, I mean, 1161 01:03:53,520 --> 01:03:56,640 it didn't even occur to me that it was hard to read, right? 1162 01:03:56,640 --> 01:03:58,620 So it's like, you know, you're into these two 1163 01:03:58,620 --> 01:03:59,370 different worlds. 1164 01:03:59,370 --> 01:04:02,820 And so you have to work to meet in the middle, 1165 01:04:02,820 --> 01:04:04,986 and it takes time. 1166 01:04:04,986 --> 01:04:07,220 PETER SZOLOVITS: It also takes the right people. 1167 01:04:07,220 --> 01:04:12,260 So I have to say that Zak was probably 1168 01:04:12,260 --> 01:04:15,260 very clever in bringing the right people to the table 1169 01:04:15,260 --> 01:04:19,580 and locking those into that room for three hours at a time 1170 01:04:19,580 --> 01:04:23,510 because, for example, our biostatistician, 1171 01:04:23,510 --> 01:04:29,200 Tianxi Cai just, you know, she speaks AI 1172 01:04:29,200 --> 01:04:31,730 or she has learned to speak AI. 1173 01:04:31,730 --> 01:04:35,080 And there are still plenty of statisticians 1174 01:04:35,080 --> 01:04:38,170 who just have allergic reactions to the kinds 1175 01:04:38,170 --> 01:04:41,020 just things that we do, and it would be very 1176 01:04:41,020 --> 01:04:42,880 difficult to work with them. 1177 01:04:42,880 --> 01:04:45,400 So having the right combination of people 1178 01:04:45,400 --> 01:04:47,670 is also really I think critical. 1179 01:04:47,670 --> 01:04:49,420 KATHERINE LIAO: As one of my mentors said, 1180 01:04:49,420 --> 01:04:50,753 you have to kiss a lot of frogs. 1181 01:04:55,482 --> 01:04:57,940 AUDIENCE: I wondering if you could say a bit more about how 1182 01:04:57,940 --> 01:05:00,250 you approached the alarm fatigue with how 1183 01:05:00,250 --> 01:05:04,780 you balance [INAUDIBLE] question around how certain you are 1184 01:05:04,780 --> 01:05:07,870 versus clinical questions of how important this is versus even 1185 01:05:07,870 --> 01:05:10,985 psychological questions of, I said is too often 1186 01:05:10,985 --> 01:05:12,235 to a certain amount of people. 1187 01:05:12,235 --> 01:05:14,645 They're going to start [INAUDIBLE]?? 1188 01:05:14,645 --> 01:05:16,270 KATHERINE LIAO: Yeah, you've definitely 1189 01:05:16,270 --> 01:05:19,590 hit the nail on the head of one of the major barriers, 1190 01:05:19,590 --> 01:05:20,500 or several things. 1191 01:05:20,500 --> 01:05:22,300 The alarm fatigue is one of them. 1192 01:05:22,300 --> 01:05:27,610 So EMRs became more prominent in 2010. 1193 01:05:27,610 --> 01:05:29,680 But now, along with EMRs came a lot 1194 01:05:29,680 --> 01:05:33,280 of regulations on physicians. 1195 01:05:33,280 --> 01:05:36,557 And then came getting rid of our old systems 1196 01:05:36,557 --> 01:05:38,890 for these new systems that are now government compliant. 1197 01:05:38,890 --> 01:05:41,620 So Epic is this big monster system 1198 01:05:41,620 --> 01:05:44,320 that's being rolled out across the country, where 1199 01:05:44,320 --> 01:05:45,940 you literally have-- 1200 01:05:45,940 --> 01:05:49,240 it's so complicated in places like Mayo. 1201 01:05:49,240 --> 01:05:51,010 They hire scribes. 1202 01:05:51,010 --> 01:05:52,540 The physicians sits in the office 1203 01:05:52,540 --> 01:05:54,280 and there's another person who actually listens in 1204 01:05:54,280 --> 01:05:55,460 and types and then clicks all the buttons 1205 01:05:55,460 --> 01:05:57,520 that you need to get the information there. 1206 01:05:57,520 --> 01:06:01,425 So alarm fatigue is definitely one of the barriers. 1207 01:06:01,425 --> 01:06:02,800 But the other barrier is the fact 1208 01:06:02,800 --> 01:06:06,130 that the EMRs are so user-unfriendly now. 1209 01:06:06,130 --> 01:06:08,260 They're not built for clinical care. 1210 01:06:08,260 --> 01:06:10,060 They're built for billing. 1211 01:06:10,060 --> 01:06:12,430 We have to be careful about how we roll this out. 1212 01:06:12,430 --> 01:06:16,530 And that's one reason why I think things have been held up, 1213 01:06:16,530 --> 01:06:18,130 actually. 1214 01:06:18,130 --> 01:06:19,600 Not necessarily the science. 1215 01:06:19,600 --> 01:06:22,462 It's the implementation part is going to be very hard. 1216 01:06:22,462 --> 01:06:24,420 PETER SZOLOVITS: So that isn't new, by the way. 1217 01:06:24,420 --> 01:06:28,810 I remember a class I taught in biomedical computing 1218 01:06:28,810 --> 01:06:29,990 about 15 years ago. 1219 01:06:29,990 --> 01:06:33,730 David Bates, who's the chief of general internal medicine 1220 01:06:33,730 --> 01:06:38,740 or something at the Brigham, came in 1221 01:06:38,740 --> 01:06:40,640 and gave a guest lecture. 1222 01:06:40,640 --> 01:06:43,660 And he was describing their experience 1223 01:06:43,660 --> 01:06:47,140 with a drug-drug interaction system 1224 01:06:47,140 --> 01:06:49,690 that they had implemented. 1225 01:06:49,690 --> 01:06:53,500 And they purchased a data set from a vendor called 1226 01:06:53,500 --> 01:06:56,890 First Databank that had scoured the literature 1227 01:06:56,890 --> 01:07:01,240 and found all the instances where people had reported cases 1228 01:07:01,240 --> 01:07:05,710 where a patient taking both this medication and that medication 1229 01:07:05,710 --> 01:07:07,450 had an apparent adverse event. 1230 01:07:07,450 --> 01:07:10,620 So there was some interaction between them. 1231 01:07:10,620 --> 01:07:13,980 And they bought this thing, they implemented it, 1232 01:07:13,980 --> 01:07:18,270 and they discovered that, on the majority of drug orders 1233 01:07:18,270 --> 01:07:21,750 that they were making through their pharmacy system, 1234 01:07:21,750 --> 01:07:26,010 a big red alert would pop up saying, you know, are you aware 1235 01:07:26,010 --> 01:07:28,920 of the fact that there is a potential interaction 1236 01:07:28,920 --> 01:07:30,870 between this drug and some other drug 1237 01:07:30,870 --> 01:07:33,480 that this patient is taking. 1238 01:07:33,480 --> 01:07:38,060 And the problem is that the incentives for the company that 1239 01:07:38,060 --> 01:07:40,650 curated this database were to make 1240 01:07:40,650 --> 01:07:42,840 sure they didn't miss anything, because they 1241 01:07:42,840 --> 01:07:46,800 didn't want to be responsible for failing to alarm. 1242 01:07:46,800 --> 01:07:48,630 But of course, there's no pushback 1243 01:07:48,630 --> 01:07:53,490 saying that if you warn on every second order, 1244 01:07:53,490 --> 01:07:56,920 then no one's going to pay any attention to any of them. 1245 01:07:56,920 --> 01:07:58,800 And so David's solution was to get 1246 01:07:58,800 --> 01:08:01,440 a bunch of the senior doctors together 1247 01:08:01,440 --> 01:08:05,970 and they did some study of what actual adverse events had they 1248 01:08:05,970 --> 01:08:08,490 experienced at the hospital. 1249 01:08:08,490 --> 01:08:12,030 And they cut this list of thousands of drug interactions 1250 01:08:12,030 --> 01:08:14,550 down to 20. 1251 01:08:14,550 --> 01:08:16,590 And they said, OK, those are the only ones 1252 01:08:16,590 --> 01:08:17,984 we're going to alarm on. 1253 01:08:17,984 --> 01:08:19,651 KATHERINE LIAO: And then they threw that 1254 01:08:19,651 --> 01:08:20,649 out when Epic came in. 1255 01:08:20,649 --> 01:08:23,220 So now I put in an order, I get like a list of 10 1256 01:08:23,220 --> 01:08:25,272 and I just click them all. 1257 01:08:25,272 --> 01:08:26,189 So that's the problem. 1258 01:08:26,189 --> 01:08:27,606 And the threshold is going to be-- 1259 01:08:27,606 --> 01:08:29,330 so there's going to be an entire-- 1260 01:08:29,330 --> 01:08:30,660 I think there's going to be entire methods 1261 01:08:30,660 --> 01:08:33,285 development that's going to have to happen between figuring out 1262 01:08:33,285 --> 01:08:36,352 where that threshold is and the fatigue from the alarms. 1263 01:08:39,523 --> 01:08:41,340 AUDIENCE: I have two questions. 1264 01:08:41,340 --> 01:08:45,660 One is about [INAUDIBLE]. 1265 01:08:45,660 --> 01:08:48,450 Like how did you approach that because we talk about this 1266 01:08:48,450 --> 01:08:51,029 in other contexts in class? 1267 01:08:51,029 --> 01:08:53,760 And the other one is, like, how can you 1268 01:08:53,760 --> 01:08:57,569 inform other countries [INAUDIBLE] done here? 1269 01:08:57,569 --> 01:08:59,620 Because, I mean, at the end of the day, 1270 01:08:59,620 --> 01:09:02,310 it's a global health issue. 1271 01:09:02,310 --> 01:09:04,724 And also drug systems are different 1272 01:09:04,724 --> 01:09:07,080 even between the US and the UK. 1273 01:09:07,080 --> 01:09:11,546 So all the mapping we're doing here, 1274 01:09:11,546 --> 01:09:14,440 how could that inform EHR or elsewhere? 1275 01:09:14,440 --> 01:09:15,548 KATHERINE LIAO: Yeah. 1276 01:09:15,548 --> 01:09:16,840 So let me answer the first one. 1277 01:09:16,840 --> 01:09:18,890 The second one is a work in progress. 1278 01:09:18,890 --> 01:09:23,550 So ICD-10 came to the US on October 1, 2015. 1279 01:09:23,550 --> 01:09:24,740 I remember. 1280 01:09:24,740 --> 01:09:27,029 It hurt us all. 1281 01:09:27,029 --> 01:09:31,830 So we actually don't have that much information on ICD-10 yet. 1282 01:09:31,830 --> 01:09:34,569 But it's definitely impacted our work. 1283 01:09:34,569 --> 01:09:37,020 So if you think about when Pete was pointing 1284 01:09:37,020 --> 01:09:40,479 to the number of ICD counts for ICD-9, for those of you who 1285 01:09:40,479 --> 01:09:43,830 don't know, ICD-9 was developed decades ago. 1286 01:09:43,830 --> 01:09:45,630 ICD-10 maybe two decades ago. 1287 01:09:45,630 --> 01:09:49,152 But what ICD-10 did was it added more granularity. 1288 01:09:49,152 --> 01:09:50,819 So for rheumatoid arthritis, I mentioned 1289 01:09:50,819 --> 01:09:54,720 it's a systemic chronic inflammatory joint disease. 1290 01:09:54,720 --> 01:09:57,960 We used to have a code that said rheumatoid arthritis. 1291 01:09:57,960 --> 01:10:00,790 In ICD-10, it now says rheumatoid arthritis, 1292 01:10:00,790 --> 01:10:03,340 rheumatoid factor positive, rheumatoid arthritis, 1293 01:10:03,340 --> 01:10:04,800 rheumatoid factor negative. 1294 01:10:04,800 --> 01:10:07,680 And under each category is RA of the right wrist, 1295 01:10:07,680 --> 01:10:09,930 RA of the left wrist, RA of the right knee, left knee. 1296 01:10:09,930 --> 01:10:10,597 Can you imagine? 1297 01:10:10,597 --> 01:10:12,540 So we're clicking off all of these. 1298 01:10:12,540 --> 01:10:18,060 And so, as it turns out, surprisingly-- 1299 01:10:18,060 --> 01:10:21,810 we're about to publish a small study now, 1300 01:10:21,810 --> 01:10:25,350 is RA any more accurate now they have all these granular-- 1301 01:10:25,350 --> 01:10:26,850 it turns out, I think we got annoyed 1302 01:10:26,850 --> 01:10:30,850 because it's actually less accurate now than the ICD-9. 1303 01:10:30,850 --> 01:10:32,020 So that's one thing. 1304 01:10:32,020 --> 01:10:35,272 But that's, you know, only two or three years of data. 1305 01:10:35,272 --> 01:10:37,230 I think it's going to become pretty equivalent. 1306 01:10:37,230 --> 01:10:39,360 The other thing is, you'll see an explosion 1307 01:10:39,360 --> 01:10:41,850 in the number of ICD codes. 1308 01:10:41,850 --> 01:10:48,980 So you have to think about how do you deal with back October 1309 01:10:48,980 --> 01:10:50,990 1, 2015 when you had one RA code, 1310 01:10:50,990 --> 01:10:54,860 but after 2015, it depends on when the patient comes in. 1311 01:10:54,860 --> 01:10:57,740 They may have RA of the right wrist on one day, 1312 01:10:57,740 --> 01:10:59,240 then on the left knee the other day. 1313 01:10:59,240 --> 01:11:00,960 That looks like a different code. 1314 01:11:00,960 --> 01:11:03,990 So right now, we have to think of systematic systems 1315 01:11:03,990 --> 01:11:06,440 to roll up. 1316 01:11:06,440 --> 01:11:10,170 I think the biggest challenge right now is the mapping. 1317 01:11:10,170 --> 01:11:15,592 So ICD-9, you know, doesn't map directly to ICD-10 or back 1318 01:11:15,592 --> 01:11:17,050 because there were diseases that we 1319 01:11:17,050 --> 01:11:20,430 didn't know when they developed ICD-9 that exist in ICD-10. 1320 01:11:20,430 --> 01:11:24,040 In ICD-10, they talk about diseases in ways that 1321 01:11:24,040 --> 01:11:26,050 weren't described in ICD-9. 1322 01:11:26,050 --> 01:11:28,090 So when you're trying to harmonize the data, 1323 01:11:28,090 --> 01:11:31,930 and this is actively something we're dealing with right now 1324 01:11:31,930 --> 01:11:36,410 at the VA, how do you now count the ICD codes? 1325 01:11:36,410 --> 01:11:40,572 How do you consider that someone has an ICD code for RA? 1326 01:11:40,572 --> 01:11:42,780 So those are all things that are being developed now. 1327 01:11:42,780 --> 01:11:45,910 CMS, Center for Medicaid and Medicare, 1328 01:11:45,910 --> 01:11:47,410 again, this is for billing purposes, 1329 01:11:47,410 --> 01:11:49,570 has come up with a mapping system that many of us 1330 01:11:49,570 --> 01:11:53,066 are using now, given what we have. 1331 01:11:53,066 --> 01:11:55,150 PETER SZOLOVITS: And by the way, the committee 1332 01:11:55,150 --> 01:12:00,890 that is designing ICD-11 has been very active for years. 1333 01:12:00,890 --> 01:12:03,940 And so there is another one coming down the pike. 1334 01:12:03,940 --> 01:12:05,380 Although, from what I understand-- 1335 01:12:05,380 --> 01:12:07,172 KATHERINE LIAO: Are you involved with that? 1336 01:12:07,172 --> 01:12:08,740 PETER SZOLOVITS: No. 1337 01:12:08,740 --> 01:12:11,360 But Chris Chute is or was. 1338 01:12:11,360 --> 01:12:12,490 KATHERINE LIAO: Yes, I saw. 1339 01:12:12,490 --> 01:12:13,323 I said, don't do it. 1340 01:12:13,323 --> 01:12:14,920 PETER SZOLOVITS: Well, but actually, 1341 01:12:14,920 --> 01:12:16,720 I'm a little bit optimistic because 1342 01:12:16,720 --> 01:12:20,260 unlike the traditional ICD system, 1343 01:12:20,260 --> 01:12:23,260 this one is based on SNOMED, which has a much more 1344 01:12:23,260 --> 01:12:25,330 logical structure. 1345 01:12:25,330 --> 01:12:27,850 So you know, my favorite ICD-10 code 1346 01:12:27,850 --> 01:12:34,640 is closed fracture of the left femur 1347 01:12:34,640 --> 01:12:36,215 due to spacecraft accident. 1348 01:12:38,815 --> 01:12:42,343 KATHERINE LIAO: I didn't even know that existed. 1349 01:12:42,343 --> 01:12:43,760 PETER SZOLOVITS: As far as I know, 1350 01:12:43,760 --> 01:12:47,090 that code has never been applied to anybody. 1351 01:12:47,090 --> 01:12:50,600 But it's there just in case. 1352 01:12:50,600 --> 01:12:51,100 Yeah. 1353 01:12:51,100 --> 01:12:53,010 AUDIENCE: So wait, for the ICD-11 1354 01:12:53,010 --> 01:12:55,770 you don't think take that long to exist because it's 1355 01:12:55,770 --> 01:12:56,693 a more logical system? 1356 01:12:56,693 --> 01:12:57,860 PETER SZOLOVITS: So ICD-11-- 1357 01:12:57,860 --> 01:12:59,100 well, I don't know what it's going 1358 01:12:59,100 --> 01:13:00,980 to be because they haven't defined it yet. 1359 01:13:00,980 --> 01:13:04,220 But the idea behind SNOMED is that it's more 1360 01:13:04,220 --> 01:13:05,750 a combinatorial system. 1361 01:13:05,750 --> 01:13:09,020 So it's more like a grammar of descriptions 1362 01:13:09,020 --> 01:13:13,370 that you can assemble according to certain rules of what 1363 01:13:13,370 --> 01:13:15,770 assemblies make sense. 1364 01:13:15,770 --> 01:13:18,020 And so that means that you don't have 1365 01:13:18,020 --> 01:13:23,360 to explicitly mention something like the spacecraft accident 1366 01:13:23,360 --> 01:13:24,110 one. 1367 01:13:24,110 --> 01:13:27,260 But if that ever arises, then there 1368 01:13:27,260 --> 01:13:29,750 is a way to construct something that 1369 01:13:29,750 --> 01:13:33,575 would describe that situation. 1370 01:13:33,575 --> 01:13:35,450 KATHERINE LIAO: I ran into Chris at a meeting 1371 01:13:35,450 --> 01:13:36,992 and he said something along the lines 1372 01:13:36,992 --> 01:13:40,818 that he thinks it's going to be more NLP-based, even. 1373 01:13:40,818 --> 01:13:41,360 I don't know. 1374 01:13:41,360 --> 01:13:42,860 Is it going to be more like a language? 1375 01:13:42,860 --> 01:13:44,300 PETER SZOLOVITS: Well, you need to ask him. 1376 01:13:44,300 --> 01:13:45,080 KATHERINE LIAO: Yeah, I don't know. 1377 01:13:45,080 --> 01:13:46,205 He hints at it [INAUDIBLE]. 1378 01:13:46,205 --> 01:13:47,913 I was like, OK, this will be interesting. 1379 01:13:47,913 --> 01:13:49,790 PETER SZOLOVITS: I think it's definitely 1380 01:13:49,790 --> 01:13:51,710 more like a language, but it'll be more 1381 01:13:51,710 --> 01:13:57,410 like the old Fred Thompson or the Diamond Diagram 1382 01:13:57,410 --> 01:13:58,760 kind of language. 1383 01:13:58,760 --> 01:14:01,280 It's a designed language that you're 1384 01:14:01,280 --> 01:14:04,720 going to have to learn in order to figure out how to describe 1385 01:14:04,720 --> 01:14:06,260 things appropriately. 1386 01:14:06,260 --> 01:14:10,070 Or at least your billing clerk will have to learn it. 1387 01:14:10,070 --> 01:14:10,663 Yeah? 1388 01:14:10,663 --> 01:14:12,288 AUDIENCE: I know we're towards the end. 1389 01:14:12,288 --> 01:14:16,310 But I had a question about when a clinician is trying 1390 01:14:16,310 --> 01:14:19,360 to label data, for example, training data, 1391 01:14:19,360 --> 01:14:22,640 are there any ambiguities ever, where sometimes this 1392 01:14:22,640 --> 01:14:24,530 is definitely-- this person has RA. 1393 01:14:24,530 --> 01:14:27,925 This person, I'm not really sure. 1394 01:14:27,925 --> 01:14:29,300 How do you take that into account 1395 01:14:29,300 --> 01:14:30,800 when you're actually training a [INAUDIBLE]?? 1396 01:14:30,800 --> 01:14:31,520 KATHERINE LIAO: Yeah. 1397 01:14:31,520 --> 01:14:33,520 So we actually have three categories-- definite, 1398 01:14:33,520 --> 01:14:35,240 possible, and no. 1399 01:14:35,240 --> 01:14:36,650 So there is always ambiguity. 1400 01:14:36,650 --> 01:14:39,020 And then you always want to have more than one reviewer. 1401 01:14:39,020 --> 01:14:42,050 So in clinical trials when you have outcomes, 1402 01:14:42,050 --> 01:14:44,280 you have what we call adjudication. 1403 01:14:44,280 --> 01:14:45,920 So you have some kind of system where 1404 01:14:45,920 --> 01:14:48,920 you have the first sit down, you have to define the phenotype. 1405 01:14:48,920 --> 01:14:50,930 Because not everybody is going to agree, even 1406 01:14:50,930 --> 01:14:53,960 for a really clear disease, how do you define the disease. 1407 01:14:53,960 --> 01:14:56,420 What are the components that has to happen. 1408 01:14:56,420 --> 01:14:59,750 For that, they're usually for societies or classification 1409 01:14:59,750 --> 01:15:00,780 criteria for research. 1410 01:15:00,780 --> 01:15:02,570 So there actually is one for RA, you 1411 01:15:02,570 --> 01:15:04,770 know, for coronary artery disease. 1412 01:15:04,770 --> 01:15:06,770 And then it is having those different categories 1413 01:15:06,770 --> 01:15:09,170 in a very structured system for adjudicating. 1414 01:15:09,170 --> 01:15:13,060 You know, blindly having two reviewers review 20, you know, 1415 01:15:13,060 --> 01:15:15,110 let's say 20 of the same notes and look 1416 01:15:15,110 --> 01:15:17,040 at the integrated reliability. 1417 01:15:17,040 --> 01:15:17,540 Yeah. 1418 01:15:17,540 --> 01:15:20,555 That's a big issue. 1419 01:15:20,555 --> 01:15:21,680 PETER SZOLOVITS: All right. 1420 01:15:21,680 --> 01:15:24,930 I think we have expired. 1421 01:15:24,930 --> 01:15:26,910 So Kat, thank you very much. 1422 01:15:26,910 --> 01:15:29,390 KATHERINE LIAO: Yes, thank you, everybody.