# ctakes-dev mailing list archives

##### Site index · List index
Message view
Top
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject Re: Differences in dictionary built with dictionaryBuilder and sno_rx16ab from sourceforge [EXTERNAL]
Date Tue, 18 Jun 2019 00:08:26 GMT
Hi Jeff,

Thanks for doing the research.  Since the sno_rx_16ab was made 3+ years ago I can't swear
to any of those filter sets being exactly what was used.

I think that the key to working with any project is to check the dictionary against a project's
needs.  Fill in the gaps by either editing the sql (.script) file or by adding a second dictionary.
In smaller "focus" projects I usually end up augmenting the default dictionary with a small
custom bsv dictionary to catch any known synonyms or terms that aren't represented in the
default.  In projects requiring larger nets I have built dictionaries that are horribly inclusive
- 2 to 3 times the sno_rx_16ab.

Sean
________________________________________
From: Jeffrey Miller <jeffmax@gmail.com>
Sent: Monday, June 17, 2019 4:39 PM
To: dev@ctakes.apache.org
Subject: Re: Differences in dictionary built with dictionaryBuilder and sno_rx16ab from sourceforge
[EXTERNAL]

Thanks for following up Sean. I've looked into the links you sent along.
There are different groups of filters and it appears that the
dictionaryBuilder GUI is hardcoded to use the files in the "tiny"
directory. I don't think this is the set of filters used to make
sno_rx_16ab because the 'tiny' filter group contains "today" (today brand
veterinary product.  310367) in "UnwantedTexts.txt", but the
sno_rx_16ab.script file has "today" still in there. If you create a
dictionary with the dictionary builder, it does not include that term.

I thought maybe the set of files under the "default" filter directory might
be the one used for the sno_rx_16ab package so I recompiled the
dictionaryCreator GUI to use the "default" filter files and created a new
snomed rxnorm dictionary from the 2016ab umls release, but the output is
still quite different that the packaged sno_rx_16ab dictionary. From
looking at diffs, it looks like there are a substantial number of additions
to the sno_rx_16ab, so much so that I really must be missing something. For
example, for CUI 12169 which describes a low sodium diet, there are about
27 CUI terms in sno_rx_16ab.script, but in the script generated by the
dictionaryGUI there are only 7 (with the "tiny" or "default" filter groups).

On Sun, Jun 16, 2019 at 3:27 PM Remy Sanouillet <remys@foreseemed.com>
wrote:

> Thanks for the clarifications, Sean. That was very enlightening. I look
> forward to the documentation (even if it entails some suffering on your
> part.)
>
> If/when you stumble on some idle time allowing you to implement the manual
> edit panel, it would be nice to have it allow for re-partitioning the
> ontology. As you are very aware, UMLS CUIs and SNOMED do not always have a
> one-to-one correspondence resulting in a CUI matching multiples SNOMEDs or
> a SNOMED being mapped to several CUIs.
>
> In some cases, clinicians don't agree with that partitioning in specialized
> contexts and the inheritance that ensues and would like to re-assign them.
>
> Not holding my breath, but just something to keep in mind.
>
>       Remy
>
> On Sun, Jun 16, 2019 at 7:16 AM Finan, Sean <
> Sean.Finan@childrens.harvard.edu> wrote:
>
> > Hi Jeff,
> >
> > >1) ...
> > There are several collections of filter sets here:
> > ctakes-gui-res\src\main\resources\org\apache\ctakes\gui\dictionary\data\
> >
> > 2) ...
> > There is additional logic within the dictionary creator code:
> > ctakes-gui\src\main\java\org\apache\ctakes\gui\dictionary\
> >
> > I haven't gone through it in a really long time, and without doing so now
> > I can't enumerate the filters.  I have family visiting, otherwise my
> > curiosity would force me to do so and get back to you.   Honestly, it
> > should be documented somewhere, but writing (especially technical) is
> > pretty much my least favorite activity.
> >
> > Sean
> >
> >
> > p.s.
> > Please don't wait for it, but I am currently working on new dictionary
> > code and plan to introduce that in ctakes.  Again, please don't wait for
> it
> > as it is mixed in with other work and will not be available for several
> > months (if at all).
> >
> >
> > ________________________________________
> > From: Jeffrey Miller <jeffmax@gmail.com>
> > Sent: Sunday, June 16, 2019 9:49 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: Differences in dictionary built with dictionaryBuilder and
> > sno_rx16ab from sourceforge [EXTERNAL]
> >
> > Hi Sean,
> >
> > Thanks for your response. I had two follow-up questions that would be
> very
> > helpful to understand if you have a few moments:
> >
> > 1) Are the specific filters used in the official sno_rx_16ab codified
> > anywhere so that I could reproduce them?
> >
> > 2) Do these filters explain all the changes? For example, when I use the
> > dictionary creator to export sno_med and rx_norm, I only get "diabetes
> > mellitus" where as sno_rx_16ab contains both "diabetes" and "dm".
> > Especially with the addition of "dm" it feels like I must be missing a
> step
> > or a setting somewhere.
> >
> > Thanks!
> > Jeff
> >
> > On Sun, Jun 16, 2019 at 8:55 AM Finan, Sean <
> > Sean.Finan@childrens.harvard.edu> wrote:
> >
> > > Hi all,
> > >
> > > The contents of the sno_rx_16ab are a dump of the umls 2016AB snomed
> and
> > > rxnorm terms with certain symantic types.  Nothing was added, but
> > synonyms
> > > are filtered based upon various rules.  For instance, unnecessary
> > suffixes
> > > are removed ("Wart (Finding)" -> "Wart"), really long terms are
> excluded
> > > ("can walk straight line with only minimal assistance"), terms with
> dose
> > or
> > > form are ignored and so forth.
> > >
> > > Some filters can be changed by adding/removing from
> > prefix/suffix/contains
> > > lists in plaintext files or by modifying the dictionary creator code.
> > >
> > > There was no manual curation (or nothing major).  As Remy mentioned
> that
> > > requires a lot of attention and time.  The dictionary database was not
> > > intended to be perfect, just as good as possible without major
> > investment -
> > > and reproducible with updates to the umls.
> > >
> > > As the dictionary is released as a sql database, you should be able to
> > > and remove fairly easily if sql savvy.  I have long wanted to add a
> > "manual
> > > edit" panel to the dictionary gui, but haven't had the time.  If
> anybody
> > > else would like to work on such a tool that would be tonic.
> > >
> > > Sean
> > >
> > >
> > > ________________________________________
> > > From: Harish Kulkarni <harish.m.kulkarni@gmail.com>
> > > Sent: Saturday, June 15, 2019 5:16 PM
> > > To: dev@ctakes.apache.org
> > > Subject: Re: Differences in dictionary built with dictionaryBuilder and
> > > sno_rx16ab from sourceforge [EXTERNAL]
> > >
> > > unsubscribe
> > >
> > > On Sat, Jun 15, 2019 at 1:40 PM Remy Sanouillet <remys@foreseemed.com>
> > > wrote:
> > >
> > > > Yes, I agree it would be nice because the tokenization that occurs
> when
> > > > creating the dictionaries from the releases make comparisons a bit
> > tricky
> > > > and is not 100% reversible. I would love to hear an answer to your
> > > > quandary.
> > > >
> > > >      Remy
> > > >
> > > > On Sat, Jun 15, 2019 at 1:23 PM Jeffrey Miller <jeffmax@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks, I was curious if the cTAKES devs that created the
> sno_rx_16ab
> > > > > dictionary had put the differences applied to the default UMLS
> output
> > > > into
> > > > > version control in some form. I imagine the
> > > > > additions/synonyms/abbreviations that were added manually must have
> > > been
> > > > > collected over time somewhere prior to merging them with 2016ab
> UMLS
> > > > > release? I basically want to recreate the default cTAKES 4.0.0
> > release
> > > > with
> > > > > an additional ontology and the latest terms. I can likely come up
> > with
> > > a
> > > > > diff myself but was wondering if this was already maintained as
> part
> > of
> > > > > cTAKES.
> > > > >
> > > > > On Sat, Jun 15, 2019 at 12:24 PM Remy Sanouillet <
> > remys@foreseemed.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Yes, that's pretty much what we do too. Not only to enhance
the
> > > > > dictionary,
> > > > > > but to put in corrections because, lo and behold, there are
some
> > > errors
> > > > > in
> > > > > > there!. As you know, an ontology is a constant curation job
and
> > that
> > > > > > script, under SCM, allows you to isolate those changes and,
if
> > > > necessary,
> > > > > > re-apply them to new versions.
> > > > > >
> > > > > >       Remy
> > > > > >
> > > > > > On Sat, Jun 15, 2019 at 8:36 AM gandhi rajan <
> > > gandhirajan.n@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jeff,
> > > > > > >
> > > > > > > As far as I know, maintaining a separate SQL script to
> > > > > > > entries should work seamlessly.
> > > > > > >
> > > > > > > On Saturday, June 15, 2019, Jeffrey Miller <jeffmax@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Thanks Remy. Does anyone know if these manually curated
> > > > > > > > modifications/synonyms are tracked anywhere (aside
from the
> > > > > dictionary
> > > > > > > > itself) so they can be carried forward in future dictionary
> > > > > > > >
> > > > > > > > On Fri, Jun 14, 2019 at 4:28 PM Remy Sanouillet <
> > > > > remys@foreseemed.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > From my experience, it seems pretty obvious that
> sno_rx_16ab
> > > is a
> > > > > > > curated
> > > > > > > > > dictionary based on the SNOMED 2016AB release.
It does not
> > > > contain
> > > > > > the
> > > > > > > > full
> > > > > > > > > set but it has additional edits and synonyms
that are
> pretty
> > > > useful
> > > > > > > > > (including 'dm').
> > > > > > > > >
> > > > > > > > > We have had to manage those mods as an adjunct.
> > > > > > > > >
> > > > > > > > >       Remy
> > > > > > > > >
> > > > > > > > > On Fri, Jun 14, 2019 at 1:03 PM Jeffrey Miller
<
> > > > jeffmax@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > > I have created a custom dictionary from
the latest UMLS
> > > release
> > > > > > with
> > > > > > > > > > SNOMEDCT_US and  RxNorm and I've noticed
it seems to be
> > > > > generating
> > > > > > > > > .script
> > > > > > > > > > file with unexpected differences as compared
to the
> > > sno_rx_16ab
> > > > > > file
> > > > > > > > > > available as part of the cTAKES release.
Specifically,
> for
> > > > > > diabetes,
> > > > > > > it
> > > > > > > > > is
> > > > > > > > > > missing these two rows:
> > > > > > > > > > INSERT INTO CUI_TERMS VALUES(11849,0,1,'dm','dm')
> > > > > > > > > > INSERT INTO CUI_TERMS
> > VALUES(11849,0,1,'diabetes','diabetes')
> > > > > > > > > >
> > > > > > > > > > and only has this one:
> > > > > > > > > > INSERT INTO CUI_TERMS VALUES(11849,1,2,'diabetes
> > > > > > > mellitus','mellitus')
> > > > > > > > > >
> > > > > > > > > > The end result is that "diabetes" is not
being picked up
> in
> > > the
> > > > > > test
> > > > > > > > > text I
> > > > > > > > > > am running through- it requires the full
'diabetes
> > mellitus'.
> > > > > > > > > >
> > > > > > > > > > Is there any setting on the UMLS install
side or the
> > ctTAKES
> > > > > > > dictionary
> > > > > > > > > > creator that could account for missing alternative
forms
> > like
> > > > > this?
> > > > > > > > I've
I think is
> the
> > > one
> > > > > used
> > > > > > > to
> > > > > > > > > > create the bundled sno_rx_16ab package?)
and I am not
> > getting
> > > > the
> > > > > > > > > alternate
> > > > > > > > > > forms in that dictionary either.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Jeff
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > > Gandhi
> > > > > > >
> > > > > > > "The best way to find urself is to lose urself in the service
> of
> > > > others
> > > > > > > !!!"
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
View raw message