ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey Miller <jeff...@gmail.com>
Subject Re: Differences in dictionary built with dictionaryBuilder and sno_rx16ab from sourceforge [EXTERNAL]
Date Tue, 25 Jun 2019 14:11:53 GMT
Hi Sean,

Thanks for the clarification, I think that help explains some of the
unexpected synonyms that appear in the sno_rx_16ab dictionary (for example,
DM for diabetes mellitus is coming in from another ontology (could be
MEDCIN) that was installed as part of UMLS, it was not manually added to
sno_rx_16ab). I suspect this confusion stems from people who only installed
the subset of UMLS they were interested in, like only installing snomed and
rxnorm using Metamorphsys. If you do that and compare the resulting cTAKES
dictionary to the sno_rx_16ab it will be missing many synonyms. I did
realize where the "diabete mellitus" was coming from- this is from the
Consumer Health Vocabulary (CHV, also part of UMLS), which intentionally
contains common misspellings and other term usages (see
https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/CHV/). One
thing I noticed- there appears to be a reconciliation process when
processing synonyms from other ontologies in the dictionary creator. It
seems like it tries to reduce the number of synonyms for a term if there
seems to be coverage for the text span of one term with another in the same
CUI, but the result can sometimes be a little odd. For example, when you
choose snomed and rxnorm, but have other ontologies available for synonyms,
I think 'diabetes' (from another ontology, MEDCIN for one, but mapped to
the same CUI) ends up consuming "diabetes mellitus", so that term does not
actually appear (you can see this in sno_rx_16ab), but "diabete mellitus"
does persist (likely because diabetes is not a subset of that string).

grep -i "'diabetes mellitus'" sno_rx_16ab.script
INSERT INTO PREFTERM VALUES(11849,'Diabetes Mellitus')

There other examples of similar issues- for example, CUI 729346, "juvenile
osteochondrosis" is present in a dictionary if created with only snomed
installed, but if you also install CHV, it does not make it into the final
dictionary, only these do:

729346|2|3|osteochondropathy - juven|juven
729346|1|2|osteochondritis juvenilis|juvenilis
729346|1|2|juvenile osteochondritis|osteochondritis

A specific example that I have run into involves HPO alone versus a
dictionary created when Snomed was also available for synonyms. In that
case there are a few oddities that arise. For example, "severe short
stature", which is in the HPO, does not make it into the dictionary when
Snomed is installed alongside it using Metamorphsys, but is in there if HPO
alone is installed.

Out of curiosity, is there a practical difference in the resulting cTAKES
dictionary if you select the Source and Target column for a one ontology
(and nothing else), versus selecting the Source and Target columns for one
ontology and just the Source of all other ontologies installed? I know that
with the Source of all the ontologies checked, the ontology terms all end
up in the CUI_TERMS table, but since they aren't in the any target table,
would the effect be the same as leaving them unchecked (the synonyms of the
unchecked ontologies would be matched when running cTAKES if they were of
the same CUI as the selected ontology)?

Thanks,
Jeff

On Mon, Jun 24, 2019 at 10:58 AM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> Hi Jeff,
>
> The dictionary creator uses the CUI set from selected sources, but
> synonyms from all available sources for CUIs in that set.
>
> I am not sure what is going on with the 's' in "diabetes".  A grep for
> "diabetes mellitus" and "diabete mellitus" in the umls mrconso file might
> have a hint.  Perhaps some code thinks that it is fixing a plural term?
>
> Sean
> ________________________________________
> From: Jeffrey Miller <jeffmax@gmail.com>
> Sent: Tuesday, June 18, 2019 10:23 PM
> To: dev@ctakes.apache.org
> Subject: Re: Differences in dictionary built with dictionaryBuilder and
> sno_rx16ab from sourceforge [EXTERNAL]
>
> Thanks Sean. I actually think I figured out what is causing the difference.
> When I create the UMLS install on my machine, I only install RxNorm and
> SNOMEDCT_US, so when I use the dictionaryCreator GUI, there are only those
> two sources on the left. I noticed in the screenshots on the wiki page for
> the dictionary creator GUI that many sources were installed, but only
> SNOMEDCT_US and RxNorm were selected. So, I tried installing all of the
> active UMLS set (but still only selecting RxNorm and SNOMEDCT_US in the
> dictionaryCreator GUI) and it made a difference as to which terms appeared
> in the final cTAKES dictionary. As an example, I now get the "DM" entry for
> diabetes. I don't know why this should make a difference, but it appears
> that it does.
>
> Another odd observation related to this. In the sno_rx_2016ab file, I
> noticed there seems to be an error:
> INSERT INTO CUI_TERMS VALUES(11849,0,2,'diabete mellitus','diabete')
>
> The 's' is missing from diabetes. When I created my dictionary (from the
> restricted UMLS install, but still 2016ab) the cTAKES dictionary entry for
> that term is correct:
> INSERT INTO CUI_TERMS VALUES(11849,1,2,'diabetes mellitus','mellitus')
>
> When I created the dictionary from the full cTAKES install tonight, that
> error appeared again.
>
> Jeff
>
>
>
> On Mon, Jun 17, 2019 at 8:08 PM Finan, Sean <
> Sean.Finan@childrens.harvard.edu> wrote:
>
> > Hi Jeff,
> >
> > Thanks for doing the research.  Since the sno_rx_16ab was made 3+ years
> > ago I can't swear to any of those filter sets being exactly what was
> used.
> >
> > I think that the key to working with any project is to check the
> > dictionary against a project's needs.  Fill in the gaps by either editing
> > the sql (.script) file or by adding a second dictionary.  In smaller
> > "focus" projects I usually end up augmenting the default dictionary with
> a
> > small custom bsv dictionary to catch any known synonyms or terms that
> > aren't represented in the default.  In projects requiring larger nets I
> > have built dictionaries that are horribly inclusive - 2 to 3 times the
> > sno_rx_16ab.
> >
> > Sean
> > ________________________________________
> > From: Jeffrey Miller <jeffmax@gmail.com>
> > Sent: Monday, June 17, 2019 4:39 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Differences in dictionary built with dictionaryBuilder and
> > sno_rx16ab from sourceforge [EXTERNAL]
> >
> > Thanks for following up Sean. I've looked into the links you sent along.
> > There are different groups of filters and it appears that the
> > dictionaryBuilder GUI is hardcoded to use the files in the "tiny"
> > directory. I don't think this is the set of filters used to make
> > sno_rx_16ab because the 'tiny' filter group contains "today" (today brand
> > veterinary product.  310367) in "UnwantedTexts.txt", but the
> > sno_rx_16ab.script file has "today" still in there. If you create a
> > dictionary with the dictionary builder, it does not include that term.
> >
> > I thought maybe the set of files under the "default" filter directory
> might
> > be the one used for the sno_rx_16ab package so I recompiled the
> > dictionaryCreator GUI to use the "default" filter files and created a new
> > snomed rxnorm dictionary from the 2016ab umls release, but the output is
> > still quite different that the packaged sno_rx_16ab dictionary. From
> > looking at diffs, it looks like there are a substantial number of
> additions
> > to the sno_rx_16ab, so much so that I really must be missing something.
> For
> > example, for CUI 12169 which describes a low sodium diet, there are about
> > 27 CUI terms in sno_rx_16ab.script, but in the script generated by the
> > dictionaryGUI there are only 7 (with the "tiny" or "default" filter
> > groups).
> >
> > On Sun, Jun 16, 2019 at 3:27 PM Remy Sanouillet <remys@foreseemed.com>
> > wrote:
> >
> > > Thanks for the clarifications, Sean. That was very enlightening. I look
> > > forward to the documentation (even if it entails some suffering on your
> > > part.)
> > >
> > > If/when you stumble on some idle time allowing you to implement the
> > manual
> > > edit panel, it would be nice to have it allow for re-partitioning the
> > > ontology. As you are very aware, UMLS CUIs and SNOMED do not always
> have
> > a
> > > one-to-one correspondence resulting in a CUI matching multiples SNOMEDs
> > or
> > > a SNOMED being mapped to several CUIs.
> > >
> > > In some cases, clinicians don't agree with that partitioning in
> > specialized
> > > contexts and the inheritance that ensues and would like to re-assign
> > them.
> > >
> > > Not holding my breath, but just something to keep in mind.
> > >
> > >       Remy
> > >
> > > On Sun, Jun 16, 2019 at 7:16 AM Finan, Sean <
> > > Sean.Finan@childrens.harvard.edu> wrote:
> > >
> > > > Hi Jeff,
> > > >
> > > > >1) ...
> > > > There are several collections of filter sets here:
> > > >
> > ctakes-gui-res\src\main\resources\org\apache\ctakes\gui\dictionary\data\
> > > >
> > > > 2) ...
> > > > There is additional logic within the dictionary creator code:
> > > > ctakes-gui\src\main\java\org\apache\ctakes\gui\dictionary\
> > > >
> > > > I haven't gone through it in a really long time, and without doing so
> > now
> > > > I can't enumerate the filters.  I have family visiting, otherwise my
> > > > curiosity would force me to do so and get back to you.   Honestly, it
> > > > should be documented somewhere, but writing (especially technical) is
> > > > pretty much my least favorite activity.
> > > >
> > > > Sean
> > > >
> > > >
> > > > p.s.
> > > > Please don't wait for it, but I am currently working on new
> dictionary
> > > > code and plan to introduce that in ctakes.  Again, please don't wait
> > for
> > > it
> > > > as it is mixed in with other work and will not be available for
> several
> > > > months (if at all).
> > > >
> > > >
> > > > ________________________________________
> > > > From: Jeffrey Miller <jeffmax@gmail.com>
> > > > Sent: Sunday, June 16, 2019 9:49 AM
> > > > To: dev@ctakes.apache.org
> > > > Subject: Re: Differences in dictionary built with dictionaryBuilder
> and
> > > > sno_rx16ab from sourceforge [EXTERNAL]
> > > >
> > > > Hi Sean,
> > > >
> > > > Thanks for your response. I had two follow-up questions that would be
> > > very
> > > > helpful to understand if you have a few moments:
> > > >
> > > > 1) Are the specific filters used in the official sno_rx_16ab codified
> > > > anywhere so that I could reproduce them?
> > > >
> > > > 2) Do these filters explain all the changes? For example, when I use
> > the
> > > > dictionary creator to export sno_med and rx_norm, I only get
> "diabetes
> > > > mellitus" where as sno_rx_16ab contains both "diabetes" and "dm".
> > > > Especially with the addition of "dm" it feels like I must be missing
> a
> > > step
> > > > or a setting somewhere.
> > > >
> > > > Thanks!
> > > > Jeff
> > > >
> > > > On Sun, Jun 16, 2019 at 8:55 AM Finan, Sean <
> > > > Sean.Finan@childrens.harvard.edu> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > The contents of the sno_rx_16ab are a dump of the umls 2016AB
> snomed
> > > and
> > > > > rxnorm terms with certain symantic types.  Nothing was added, but
> > > > synonyms
> > > > > are filtered based upon various rules.  For instance, unnecessary
> > > > suffixes
> > > > > are removed ("Wart (Finding)" -> "Wart"), really long terms are
> > > excluded
> > > > > ("can walk straight line with only minimal assistance"), terms with
> > > dose
> > > > or
> > > > > form are ignored and so forth.
> > > > >
> > > > > Some filters can be changed by adding/removing from
> > > > prefix/suffix/contains
> > > > > lists in plaintext files or by modifying the dictionary creator
> code.
> > > > >
> > > > > There was no manual curation (or nothing major).  As Remy mentioned
> > > that
> > > > > requires a lot of attention and time.  The dictionary database was
> > not
> > > > > intended to be perfect, just as good as possible without major
> > > > investment -
> > > > > and reproducible with updates to the umls.
> > > > >
> > > > > As the dictionary is released as a sql database, you should be able
> > to
> > > > add
> > > > > and remove fairly easily if sql savvy.  I have long wanted to add
a
> > > > "manual
> > > > > edit" panel to the dictionary gui, but haven't had the time.  If
> > > anybody
> > > > > else would like to work on such a tool that would be tonic.
> > > > >
> > > > > Sean
> > > > >
> > > > >
> > > > > ________________________________________
> > > > > From: Harish Kulkarni <harish.m.kulkarni@gmail.com>
> > > > > Sent: Saturday, June 15, 2019 5:16 PM
> > > > > To: dev@ctakes.apache.org
> > > > > Subject: Re: Differences in dictionary built with dictionaryBuilder
> > and
> > > > > sno_rx16ab from sourceforge [EXTERNAL]
> > > > >
> > > > > unsubscribe
> > > > >
> > > > > On Sat, Jun 15, 2019 at 1:40 PM Remy Sanouillet <
> > remys@foreseemed.com>
> > > > > wrote:
> > > > >
> > > > > > Yes, I agree it would be nice because the tokenization that
> occurs
> > > when
> > > > > > creating the dictionaries from the releases make comparisons
a
> bit
> > > > tricky
> > > > > > and is not 100% reversible. I would love to hear an answer to
> your
> > > > > > quandary.
> > > > > >
> > > > > >      Remy
> > > > > >
> > > > > > On Sat, Jun 15, 2019 at 1:23 PM Jeffrey Miller <
> jeffmax@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Thanks, I was curious if the cTAKES devs that created the
> > > sno_rx_16ab
> > > > > > > dictionary had put the differences applied to the default
UMLS
> > > output
> > > > > > into
> > > > > > > version control in some form. I imagine the
> > > > > > > additions/synonyms/abbreviations that were added manually
must
> > have
> > > > > been
> > > > > > > collected over time somewhere prior to merging them with
2016ab
> > > UMLS
> > > > > > > release? I basically want to recreate the default cTAKES
4.0.0
> > > > release
> > > > > > with
> > > > > > > an additional ontology and the latest terms. I can likely
come
> up
> > > > with
> > > > > a
> > > > > > > diff myself but was wondering if this was already maintained
as
> > > part
> > > > of
> > > > > > > cTAKES.
> > > > > > >
> > > > > > > On Sat, Jun 15, 2019 at 12:24 PM Remy Sanouillet <
> > > > remys@foreseemed.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Yes, that's pretty much what we do too. Not only to
enhance
> the
> > > > > > > dictionary,
> > > > > > > > but to put in corrections because, lo and behold,
there are
> > some
> > > > > errors
> > > > > > > in
> > > > > > > > there!. As you know, an ontology is a constant curation
job
> and
> > > > that
> > > > > > > > script, under SCM, allows you to isolate those changes
and,
> if
> > > > > > necessary,
> > > > > > > > re-apply them to new versions.
> > > > > > > >
> > > > > > > >       Remy
> > > > > > > >
> > > > > > > > On Sat, Jun 15, 2019 at 8:36 AM gandhi rajan <
> > > > > gandhirajan.n@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jeff,
> > > > > > > > >
> > > > > > > > > As far as I know, maintaining a separate SQL
script to add
> > > > > additional
> > > > > > > > > entries should work seamlessly.
> > > > > > > > >
> > > > > > > > > On Saturday, June 15, 2019, Jeffrey Miller <
> > jeffmax@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Remy. Does anyone know if these manually
curated
> > > > > > > > > > modifications/synonyms are tracked anywhere
(aside from
> the
> > > > > > > dictionary
> > > > > > > > > > itself) so they can be carried forward in
future
> dictionary
> > > > > > updates?
> > > > > > > > > >
> > > > > > > > > > On Fri, Jun 14, 2019 at 4:28 PM Remy Sanouillet
<
> > > > > > > remys@foreseemed.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > From my experience, it seems pretty
obvious that
> > > sno_rx_16ab
> > > > > is a
> > > > > > > > > curated
> > > > > > > > > > > dictionary based on the SNOMED 2016AB
release. It does
> > not
> > > > > > contain
> > > > > > > > the
> > > > > > > > > > full
> > > > > > > > > > > set but it has additional edits and
synonyms that are
> > > pretty
> > > > > > useful
> > > > > > > > > > > (including 'dm').
> > > > > > > > > > >
> > > > > > > > > > > We have had to manage those mods as
an adjunct.
> > > > > > > > > > >
> > > > > > > > > > >       Remy
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jun 14, 2019 at 1:03 PM Jeffrey
Miller <
> > > > > > jeffmax@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > I have created a custom dictionary
from the latest
> UMLS
> > > > > release
> > > > > > > > with
> > > > > > > > > > > > SNOMEDCT_US and  RxNorm and I've
noticed it seems to
> be
> > > > > > > generating
> > > > > > > > > > > .script
> > > > > > > > > > > > file with unexpected differences
as compared to the
> > > > > sno_rx_16ab
> > > > > > > > file
> > > > > > > > > > > > available as part of the cTAKES
release.
> Specifically,
> > > for
> > > > > > > > diabetes,
> > > > > > > > > it
> > > > > > > > > > > is
> > > > > > > > > > > > missing these two rows:
> > > > > > > > > > > > INSERT INTO CUI_TERMS VALUES(11849,0,1,'dm','dm')
> > > > > > > > > > > > INSERT INTO CUI_TERMS
> > > > VALUES(11849,0,1,'diabetes','diabetes')
> > > > > > > > > > > >
> > > > > > > > > > > > and only has this one:
> > > > > > > > > > > > INSERT INTO CUI_TERMS VALUES(11849,1,2,'diabetes
> > > > > > > > > mellitus','mellitus')
> > > > > > > > > > > >
> > > > > > > > > > > > The end result is that "diabetes"
is not being picked
> > up
> > > in
> > > > > the
> > > > > > > > test
> > > > > > > > > > > text I
> > > > > > > > > > > > am running through- it requires
the full 'diabetes
> > > > mellitus'.
> > > > > > > > > > > >
> > > > > > > > > > > > Is there any setting on the UMLS
install side or the
> > > > ctTAKES
> > > > > > > > > dictionary
> > > > > > > > > > > > creator that could account for
missing alternative
> > forms
> > > > like
> > > > > > > this?
> > > > > > > > > > I've
> > > > > > > > > > > > tried downloading the 2016AB release
(which I think
> is
> > > the
> > > > > one
> > > > > > > used
> > > > > > > > > to
> > > > > > > > > > > > create the bundled sno_rx_16ab
package?) and I am not
> > > > getting
> > > > > > the
> > > > > > > > > > > alternate
> > > > > > > > > > > > forms in that dictionary either.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Jeff
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > > Gandhi
> > > > > > > > >
> > > > > > > > > "The best way to find urself is to lose urself
in the
> service
> > > of
> > > > > > others
> > > > > > > > > !!!"
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message