Text
Status
Back from the depths, where I presented and delivered two presentations over the course of two weeks and the kind of immersion I require when I do that....
SHARP:
Figured out, again, how to log on to the Mayo cloud. Uploaded our stratified corpus.
Got Java Web Start stuff to work by modifying the Java Control Panel.
Drug NER meeting. Had an a-ha moment where I realized I could search for a string from my mystery drugs and potentially link it to a known BN/IN using RxCUI. See 'erirthrowhatever'.
Found the above example has two RxCuis for the exact same orthography. This was mostly a head-scratcher.
SuCCESS:
Got 100 more cspy and 100 more path notes.
Other:
Reviewed cognitive and functional status grant.
0 notes
Text
Clarity Personal Hx Query
Got an e-mail today with a good query for getting personal history out of Clarity. A colleague put this together.
Look at use in medical hx of ICD9 codes with 'colon' in any of the diganosis names for that code This includes a record for each time a person's medical hx is updated This is a rough proxy until a detailed code list is created SELECT dx.CURRENT_ICD9_LIST , max(dx.DX_NAME) --dedupe by choosing arbitrary dx_name for the icd code , count(*) as NumUsesInMedicalHx FROM medicaL_HX m LEFT OUTER JOIN CLARITY_EDG dx ON m.DX_ID =dx.dx_id where dx.DX_NAME like '%colon%' and dx.DX_NAME not like '%colonization%' group by dx.CURRENT_ICD9_LIST order by dx.CURRENT_ICD9_LIST
0 notes
Text
Paper Summary
Read this paper in prep for our manuscript. Seems like we're on a similar path.
· Australia is like an entire SEER site (but it’s not SEER) where you have to report certain cancers to the government. · Even though pathology labs do electronic reporting, the current reporting process to the government is done on paper. This is dumb, so can we automate this? · The first step (corpus selction) is to identify notifiable reports. These are all cytology and histology reports excluding urine, sputum, and pap smear. This is done with a query on, I think, the HL7 data. · The second step is to identify the histology type to see if it is a cancer notifiable result. This step itself has two steps. · Step 2a (NER) is to go over all the SNOMED CT concepts and reason over them (so rule based) to see if they are descendents of one of the notifiable concepts, of which there are six. They pick the most advanced of the concepts from the report as the concept for the report. · Step 2b (status annotation) is to mark each of the concepts that fit the notifiable criteria as absent, possible, or present. · If any notifiable histologies are present and aren��t BCC or SCC of skin then the result is notifiable. · Then there’s some discussion about supporting reports which makes no sense. · “The ground truth was created based on an adjudication process between the reference data set provided by a domain expert and the output of the system for all reports in the development and evaluation set.” · Somehow their corpora ended up having roughly equal number notifiable and non-notifiable reports, which seems crazy to me. · They report sensitivity, PPV, specificity and F-score. · They have 30 misclassifications over both training and test and report an ”error rate” based on this. · They have 7 false negatives and 23 false positives, and report the false negatives are more costly. · The false negatives are mostly due to sectioning and status annotation errors.
0 notes
Text
Mystery Meds
Here are some notes on my findings looking into mystery medications. Remember, these are medications that do not start with brand name (BN) or ingredient (IN), a class of things I refer to ass bnin.
There was a suggestion at the last meeting to use what I thought I was told was a 'synonym' column for the bnin listing. It turns out that was more of a normalized form or the bnin was normalized and the column I was to use was raw. Whatever, using it worked well.
Modifications to code
Any mystery med that started with a synonym was removed from the list of mystery meds because if we treat syns like bnins in a new class called bninsyns, then these meds all start with bninsyns.
All remaining mystery meds' n-grams were compared with bninsyns to find those mystery meds with a bninsyn somewhere in them, even if it wasn't initial. These are referred to as solved mystery meds.
Bug fix: When calculating the prefix of a solved mystery med, ensure that the bninsyn is full tokens, not partial tokens
Bug fix: Can't remember what this was.
Results
The number of remaining mystery meds (those without a bninsyn in them) went from 6895 four weeks ago to 3642.
The number of unique prefixes (the part of a mystery med before the first (and longest in the case of a tie) bninsyn) went from 5796 to 869.
So that's good. On our status call today, it was pointed out to me that:
There are many meds in the remaining mystery meds that do have BNs in them. But I'm not picking them up because they're surrounded by []s and I don't tokenize, so the created n-grams don't match the BN when I compare.
The RxNav tool, which I've been having trouble running inside our firewall, is good, and can give RxNorm synonyms.
Next Steps:
Fix problems with brand names hidden due to brackets.
See how well RxNav can get me the rest of the way on remaining mystery meds and even prefixes. E.g., asa -> aspirin, k+ -> potassium.
0 notes
Text
IQR Indexes
Got most of the way through figuring out the descriptive stats of our corpora. Only thing left to do is IQR and median on both test and training for progress and radiology notes combined.
The best thing I found for figuring out what indexes to use for IQR is on page 34 of my stats book.
0 notes
Text
Schwing
Big win for me today, figuring out on my own how to set up a new set of notes in the Clinical Text Explorer so we could start to get some abstraction done on them. I had to, among other things:
Get a SAS programmer to get the notes and metadata out of SAS for me.
Dump the notes each into their own file along with the metadata into an Excel file.
Import the metadata and notes to a SQL Server table.
Create a View in the proper form with the proper links to the colonoscopy data.
Hunted down the lookup table the ADE app uses to populate its Text Source menu. (This was my biggest victory since the two folks who know where this is are both effectively as reachable as one can be in the BWCA and I just pounded away at it.)
One thing I want to note is that I'm using the field Note_Dont_Use. I named it that because I was worried there might be truncated notes in there, but I verified there weren't. So it's cool.
0 notes
Text
Alcohol
Interesting meeting today on the Alcohol grant we'll be submitting. The takeaway for me, though, is that for the first part we'll be trying to identify, using machine learning (or otherwise, I suppose) those patients who have evidence of a drinking problem. E.g., ER admits while intoxicated or seen in ER for traumas we know are alcohol related. E.g., pancreatitis (I think).
Might actually fit into the delivery system since there's a push to try to identify those patients who are undercoded...charted but not coded...and whose upcoding would lead to significantly increased revenue.
0 notes
Text
Other stuff I did today
Got more descriptive stats for the epi journal paper. Going ridonkulously slow, but today I got number of path reports for training and test and all cohorts as well as the number of patients in each cohort with at least one report.
What made this go slow was realizing the directory of files wasn't filtered by the same filter we use on our cohort (primary >= 1/1/95).
Created text versions of the colonoscopy reports for the Panther project. Issue: Have no idea where the path reports are.
0 notes
Text
Selecting Columns in Notepad++
Selecting tall rectangles
If you wish to select a very long column block that extends over many pages (for example, in a very long file), this might be the best technique:
Click to position the cursor at the top left corner of the desired block.
Scroll down to the desired end of the block by any means that does not change the cursor position (drag the vertical scroll bar, use the wheel of your mouse).
Hold down alt+shift and click on the bottom right corner of the desired block.
Source.
0 notes
Text
Smartsets
Spent some time today looking for smart set data in Clarity. I had a patient ID and a snippet of text, but wasn't able to find it. Though I did find onenotewith similar information in it.
The closets I was able to find was PAT_ENC_SMARTSET, which links a patient encounter to a smart set used in that encounter. Howver, SMARTSET_ID doesn't seem to link to anything in Clarity. And anyway, it seems it would link to the smart set template, as it were, and not the encounter's version of that smart set, which is what we'd want and should be on PAT_ENC_SMARTSET.
0 notes
Text
Care Everywhere
Had a present from HQ today on our implementation of Epic's Care Everywhere and how it does and doesn't work for research.
Two big takeaways:
We get records when they're requested. So care providers can do it. We don't really have a way to go get them. But when they're gotten they become a part of the clinical record.
The outside records are stored on a different server. An InterConnect server. (I'm guessing on the capitalization there.) She also made it sound like then that these records wouldn't be part of a Clarity extract, but this seems weird to me, that Epic would be pulling from two Chronicles databases during a clinical visit.
0 notes
Text
Status
Quick and dirty post before the day ends....
Got some wins with figuring out the FAMILY_HX table as much as I did. I think peeps here are gonna like that. Had a good convo with boss about value of getting GHRI knowledge of data model.
Got some more descriptive stats for the epidemiology paper.
Had maje problems pushing with github. Was 'cuz my work pwd changed and I had to go change it in my .gitconfig file (in a protected dir natch).
Emphasized how important it is for us to get files in front of abstractors to start abstracting notes to give us a gold standard.
1 note
·
View note
Text
FX
Some valuable queries on the structured family history data available in Clarity.
select COUNT(*) N, zmh.Name, zmh.MEDICAL_HX_C code, fh.RELATION from FAMILY_HX fh inner join ZC_MEDICAL_HX zmh on fh.MEDICAL_HX_C = zmh.MEDICAL_HX_C where zmh.MEDICAL_HX_C = 20 or zmh.MEDICAL_HX_C = 30 group by zmh.NAME, zmh.MEDICAL_HX_C, fh.RELATION order by N desc
select zmh.Name, zmh.MEDICAL_HX_C code, fh.RELATION, fh.PAT_ID, fh.PAT_ENC_CSN_ID, fh.PAT_ENC_DATE_REAL, fh.HX_LNK_ENC_CSN, pe.pat_enc_date_real, fh.CONTACT_DATE, fh.LINE from FAMILY_HX fh inner join ZC_MEDICAL_HX zmh on fh.MEDICAL_HX_C = zmh.MEDICAL_HX_C left outer join pat_enc pe on fh.hx_lnk_enc_csn = pe.pat_enc_csn_id where zmh.MEDICAL_HX_C = 20 or zmh.MEDICAL_HX_C = 30 order by PAT_ID, fh.PAT_ENC_DATE_REAL asc, fh.LINE
select fh.PAT_ENC_CSN_ID, pe1.pat_enc_date_real, fh.HX_LNK_ENC_CSN, pe2.PAT_ENC_DATE_REAL from FAMILY_HX fh inner join PAT_ENC pe1 on fh.PAT_ENC_CSN_ID = pe1.PAT_ENC_CSN_ID inner join PAT_ENC pe2 on fh.HX_LNK_ENC_CSN = pe2.PAT_ENC_CSN_ID where fh.PAT_ENC_CSN_ID <> fh.HX_LNK_ENC_CSN and pe2.PAT_ENC_DATE_REAL > pe1.pat_enc_date_real
We have one remaining question, which is what the heck is the HX_LNK_ENC_CSN field? The description in the Clarity data dictionary isn't clear.
UPDATE 8/9/12: A colleague figured out what that HX_LNK_ENC_CSN field is for. The deal is that when a FX record is created, that's given its own encounter (kind of weird) and that encounter is in PAT_ENC_CSN_ID. If that FX is created in the context of a real life encounter, than the ID of that encounter goes in HX_LNK_ENC_CSN. (Otherwise that column is NULL.)
1 note
·
View note
Text
Wildcat and Panther Status Meeting
Spent a lot of the day catching up on things after a week away at training. Here's a basic rundown of the things to remember from the day, most of which sprung from a status meeting for the Wildcat and Panther projects.
Panther Notes:
The work on family history and results is on track per the schedule.
The work on symptoms is behind schedule. We had hoped to have results and error analysis done on that by now, but we don't have a gold standard yet, so nothing there. More on that later.
The documentation of the pipeline is mostly done. It's probably something that will live, though, so we decided I'd review it now and we could make sure it's aimed in the right direction.
Dev has added some (impressive) rules for family history, but we are suffering a data sparsity problem.
We assume that any mention of family history is for colorectal cancer. (Hoping the investigators don't care it doesn't distinguish polyps, though we have some ideas on how we may capture that, but, again, need more data.)
(I sent off some information on the family history that's available from structured data.)
We had a win on case-sensitivity using the Java Patterns library, reducing the size of the jape file nicely.
Some rules from the Pitt pipeline have been combined to make the same output (but still exist as separate rules in case we want to separate later). E.g., A bunch of stuff -> Lower abdominal symptoms.
Some rules from the Pitt pipeline have been separated. E.g., GI Bleeding -> Lower GI Bleeding and Bleeding
Where we don't have an example, the code has been built and is ready for examples to be put into the jape grammar. E.g., Acute bowel obstruction.
The biggest issue is with the lack of a gold standard. I kind of thought we'd have it by sometime last week when I was gone, but there hasn't been good communication/handoff, and so the reports got created on Thursday night and nothing happened with them Friday. Now we need to get them into a user-friendly database and then have abstractors review the the notes to create a gold standard.
Wildcat status:
Results are still good.
Some work has been done on feature selection.
I suggested doing the post-processor before going down into the categories where we only have a few examples.
The post-processor takes the results of each binary classifier and puts it through a filter, where it keeps at most the five "most important" positive classifications, where the "most important" list is compiled by humans.
I pointed out this could be problematic if we let a more important dx with a confidence score of, eg, 51% go through and not a less important dx with a confidence score of, eg, 90%. So that's food for thought.
0 notes
Text
Clone command
Ugh, because I always forget it:
> git clone http://github.com/shalgrim/util.git
Do not use https...it hangs here at work.
0 notes
Text
Other Tasks From Today
Cleaned up the figures I'm thinking of getting into the next round of the BrCaRec epidemiology paper that is coming along too slowly.
Decided to hold off until the week of 8/6 to work on an article for the internal data-centric publication because maybe I'll learn something in training next week that would be good to turn into an article.
0 notes
Text
Windows and Paths
For the most part I think Windows has been and is a good OS. But gawd the way it deals with paths, while it's getting better, has caused me so much frustration over the years.
Most of my time today was spent trying to get a script set up to hotcopy our svn repository to a network shared drive. The network drives are backed up regularly, while our C drives are not, hence the desire for the regular hotcopy.
Anyway, kept getting a 'path does not exist' error from my Python script, but only when I ran it as a Scheduled Task. In troubleshooting, I found that if I ran my script from the command line, it confirmed that the drive letter was a path, but when run from a scheduled task it said it wasn't.
Oddly, os.path.ismount recognized the letter as a drive, and this awesome command listed it as a drive letter both when run from the command line and as a scheduled task.
Anyway, I finally gave up and just passed the script the UNC path to the script via the config file. I'd gotten out of the habit of using UNC paths because cmd doesn't allow you to use them as a current directory. (There's kind of a workaround with pushd.)
So that seems to have worked and, as always with these kinds of things, I'm mostly upset I had to spend so much time on it.
To follow-up, I asked our resident tech stud why I couldn't access the network drive from a scheduled task, and he said it was because that drive is mapped by a server-side logon script which isn't run for non-interactive logins. Huynh.
3 notes
·
View notes