ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miller, Timothy" <Timothy.Mil...@childrens.harvard.edu>
Subject Re: Accessing the External Resource from the UimaContext without Using XML descriptor [EXTERNAL] [SUSPICIOUS]
Date Sun, 30 Jun 2019 11:25:56 GMT
Just wanted to make a general comment about this. I've worked on the spelling correction problem
a tiny bit and it has all of the difficulties you all describe, and I think it is also slow
in a kind of unavoidable way because it's doing quite a bit of extra work on each word.

I still would like a better solution, but I find myself wondering if there's good evidence
for spelling correction having a real impact on a problem. I would like to see a paper saying,
"we corrected all the spelling in this subset of Mimic, and it had the following effect on
performance:"

phenotyping: X -> Y
NER: X -> Y
adverse event detection: X -> Y

This is a serious amount of work to carry out these experiments, and potentially for a result
that could be negative and difficult to publish. Even if I just do it as a thought experiment
I have a hard time convincing myself that I'll see large gains in these categories.

Tim

________________________________________
From: Finan, Sean <Sean.Finan@childrens.harvard.edu>
Sent: Saturday, June 29, 2019 7:00 PM
To: dev@ctakes.apache.org
Subject: Re: Accessing the External Resource from the UimaContext without Using XML descriptor
[EXTERNAL] [SUSPICIOUS]

I implemented a quick and dirty soundex a few years ago.  Terrible precision.  I tried using
it as a "catch" for terms that were not netted by the regular lookup.   Then I found myself
running down that rabbit hole trying to identify topics like you (Pete) mention ... which
just means that I had turned an attempt at solving one nlp problem to attempting to solving
two.   I crawled out and haven't looked back.

Sean
________________________________________
From: Peter Abramowitsch <pabramowitsch@gmail.com>
Sent: Saturday, June 29, 2019 12:02 PM
To: dev@ctakes.apache.org
Subject: Re: Accessing the External Resource from the UimaContext without Using XML descriptor
[EXTERNAL]

I've been wondering whether Levenshtein Distance or Soundex have any
potential in the cTakes pipeline. For example, if, after failing the
dictionary lookup, one used something like CSpell to find a potential
concept, but then used one of these linguistic similarity methods to
quantify the difference between it and the source over the text range and
turn that into a confidence value, would it help mitigate overfitting?  I
guess the answer would be how often radically different concepts can differ
by a single character.  Another factor as was hinted at above is that
spelling issues in consumer provided text are completely different in
character from that of the rushed clinician, and these may require
completely different solutions.

On Fri, Jun 28, 2019 at 6:34 AM Remy Sanouillet <remys@foreseemed.com>
wrote:

> Hi Siamak,
>
> I agree with Sean. Spelling correction in NLP is a bit of a tar baby. We
> attempted to integrate CSpell (
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lsg3.nlm.nih.gov_Specialist_Summary_cSpell.html&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=CST_DJHBnyHs2yZy6bNYrEbg8KH5KIjIbtafSbM9NQQ&s=Yka0I-sYj7AQsBAXKF-s02fd6tpXYdHdT1chqkiJ004&e=
) to improve
> recall.
> Unfortunately we had to take if out because the overfitting affected
> precision and increased ambiguity too much.
>
>            Remy
>
> On Fri, Jun 28, 2019 at 5:20 AM Finan, Sean <
> Sean.Finan@childrens.harvard.edu> wrote:
>
> > Hi Siamak,
> >
> > The problem of misspelled terms is a big one.  I have read about
> > approaches taken by others for research, but nothing has been implemented
> > for ctakes.
> >
> > The only thing that has been done on my projects is addition to the
> > dictionary of common misspellings for a directed project.  For instance,
> in
> > a project specifically addressing brain aneurysms I added to the
> (project)
> > dictionary misspellings like "aneurism", "anurism" and "anurysm".  I
> didn't
> > worry about misspellings for terms that didn't apply to the project; I
> > didn't bother adding things like "skelatal" for "skeletal" because I
> didn't
> > really care if that term was missed.
> >
> > Sean
> > ________________________________________
> > From: Siamak Barzegar <barzegar.siamak@gmail.com>
> > Sent: Friday, June 28, 2019 6:12 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: Accessing the External Resource from the UimaContext without
> > Using XML descriptor [EXTERNAL]
> >
> > Dear Sean,
> >
> > Thank you very much for your help.
> > As you suggested, I use "BsvRareWordDictionary" and create a BSV file for
> > my small lexicon.
> > I am using it in the Spanish medical documents. As you know medical
> > documents have a lot of typos.  I was wondering to know is there any
> > dictionary lookup in cTAKES or another component from other projects that
> > can detect these small typos?
> > for example, if we have this work in dictionary file:
> > C0000001|T01|Fumador 2 paq*ue*tes
> >
> > And in the document, we have "fumador 2 paq*eu*tes". Is there any way to
> be
> > able to annotate this typo word as well?
> >
> > With Best Wishes,
> > Siamak
> >
> >
> >
> > On Tue, 25 Jun 2019 at 18:38, Finan, Sean <
> > Sean.Finan@childrens.harvard.edu>
> > wrote:
> >
> > > Ah.
> > >
> > > You are trying to use an old annotator.  It was never updated to be a
> > > uimafit component and I think that it may not work with the
> > PipelineBuilder.
> > > Newer annotators have (for the most part) simpler interfaces and do not
> > > require explicit specification of resources, resource types, etc.
> > >
> > > You have several options (worst to best):
> > > 1.  Don't use PipelineBuilder
> > > 2.  Wrap the older annotator in a uimafit-compatible component
> > > 3.  Make a method that generates a description:
> > >  UmlsDictionaryLookupAnnotator does this in a method named
> > > createAnnotatorDescription()
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_trunk_ctakes-2Ddictionary-2Dlookup_src_main_java_org_apache_ctakes_dictionary_lookup_ae_UmlsDictionaryLookupAnnotator.java&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aNXh5Gc3ezd0x905RnW8e_Qa2SPMb_NqsaOGDBxoOh8&s=2RzyJ7sX-k2SpTfrXvoZLi3rJwdUer1mNva_-a78bGc&e=
> > > -- Create the description and use the PIpelineBuilder
> addDescription(..)
> > > method.
> > > 4.  Use the newer fast dictionary instead of the old one.
> > > -- The basic equivalent of the old *CSV annotator is
> > > BsvRareWordDictionary.  It takes a single parameter "bsvPath".  Instead
> > of
> > > comma-separated values it wants Bar-separated values in the format
> > > Cui|Synonym or Cui|Tui|Synonym
> > > -- One misconception that people seem to have is that the "fast"
> > > dictionary is faster but less accurate.  Actually, it is faster and
> more
> > > accurate.  Speed was the greater difference and that name stuck.
> > >
> > > There may be other solutions, but those are what come to mind right
> now.
> > >
> > > Sean
> > > ________________________________________
> > > From: Siamak Barzegar <barzegar.siamak@gmail.com>
> > > Sent: Tuesday, June 25, 2019 11:46 AM
> > > To: dev@ctakes.apache.org
> > > Subject: Re: Accessing the External Resource from the UimaContext
> without
> > > Using XML descriptor [EXTERNAL]
> > >
> > > Thank Sean,
> > >
> > > But it seems it is just fine for getting parameters, not external
> > > resources,
> > > please see this file:
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ctakes_blob_ctakes-2D4.0.0_ctakes-2Ddictionary-2Dlookup_desc_analysis-5Fengine_DictionaryLookupAnnotatorCSV.xml&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=sZCB2_P5UuzUubmiDmngwj2ZLc19r7Zt7iktjHGEcgc&s=tG9OvH7quP0-I-MP8HPRtfBvDQqkeRregjq4WJPjgTU&e=
> > >
> > > It has several externalResourceDependency that need to be run on
> > > externalResource. How can I do it on the pipelinebiler? Do you any
> > > suggestions?
> > >
> > > From Tutorial.ex6 from example UIMA:
> > >
> > > "When the Analysis Engine is initialized, it creates a single instance
> of
> > > StringMapResource_impl and loads it with the contents of the data file.
> > > This means that the framework calls the instance's load method, passing
> > it
> > > an instance of DataResource, from which you can obtain a stream or
> > URI/URL
> > > of the external resource that was declared in the external resource..."
> > >
> > > How can do the same for Resource Dependencies in
> > > DictionalyLookuoAnnotatorCSV.xml?
> > >
> > > With Best Wishes,
> > > Siamak
> > >
> > >
> > > On Tue, 25 Jun 2019 at 16:38, Finan, Sean <
> > > Sean.Finan@childrens.harvard.edu>
> > > wrote:
> > >
> > > > Hi Siamak,
> > > >
> > > > Good question.  Yet another shortfall in the documentation ...
> > > >
> > > > There are several ways to set parameters in the  PipelineBuilder.
> > > >
> > > > The javadocs for the 4.0.0 release version are here:
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ctakes.apache.org_apidocs_4.0.0_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=sZCB2_P5UuzUubmiDmngwj2ZLc19r7Zt7iktjHGEcgc&s=jGYZiAKr_MMmm78sUVP7kSfsRbN8pHf1ZSdDba4uk7Y&e=
> > > >
> > > > You can use the set(..) method to set "global" values, or place
> > > > component-specific values using the add(..) method.
> > > >
> > > > The PipelineBuilder in trunk has the additional method:
> > > > setIfEmpty(..)        Just like set(..) except any given attributes
> are
> > > > ignored if they already have values
> > > >
> > > > In addition, the add( component, parameters... ) in trunk has been
> > > changed
> > > > to:
> > > > add( component, views, parameters ).
> > > > Views are usually used for training ml models.  To use add(..) like
> the
> > > > original (without special views) specify add( component,
> > > > Collections.emptyList(), parameters ).   The method usage add(
> > component
> > > )
> > > > still exists.  Apparently I was too lazy to properly refactor the
> > method
> > > > with the original signature ...
> > > >
> > > > I hope that helps,
> > > > Sean
> > > >
> > > > ________________________________________
> > > > From: Siamak Barzegar <barzegar.siamak@gmail.com>
> > > > Sent: Tuesday, June 25, 2019 9:23 AM
> > > > To: dev@ctakes.apache.org
> > > > Subject: Accessing the External Resource from the UimaContext without
> > > > Using XML descriptor [EXTERNAL]
> > > >
> > > > I would like to use different cTAKES' components by using
> > PipelineBuilder
> > > > (exactly the same in HelloWorldBuilderRunner.java).
> > > > But the problem is (As I understand it), PipelineBuilder does not
> read
> > > XML
> > > > descriptor of the component. I want to use the Dictionary Lookup
> > > component
> > > > (DictionaryLookupannotatorCSV.xml) in the following components:
> > > >
> > > >          PipelineBuilder builder = new PipelineBuilder();
> > > >          builder
> > > >               .add( SimpleSegmentAnnotator.class )
> > > >               .add( SentenceDetector.class )
> > > >               .add( TokenizerAnnotator.class )
> > > >                // Java Class file of DictionaryLookupannotatorCSV.xml
> > is:
> > > >               .add(DictionaryLookupAnnotator.class);
> > > >
> > > > But in the DictionaryLookupannotatorCSV.xml file, there are several
> > > > external resources that DictionaryLookupAnnotator needs to read them:
> > > >
> > > > public void initialize(UimaContext aContext) {
> > > >   iv_context = aContext;
> > > >    ....
> > > >   FileResource fResrc = (FileResource)
> > > > iv_context.getResourceObject("LookupDescriptor");
> > > >     ...
> > > >    iv_lookupSpecSet = LookupParseUtilities.parseDescriptor(descFile,
> > > > iv_context);
> > > > }
> > > >
> > > > So, what is the best way for having access to these
> > > > resources(LookupDescriptorFile, DictionaryFileResource, RxnormIndex
> and
> > > > OrangeBookIndex) in DictionaryLookupannotatorCSV.xml from the code?
> > > >
> > > > Thanks a lot.
> > > > Siamak
> > > >
> > >
> > >
> > > --
> > > Siamak Barzegar, PhD.
> > > Senior Research Engineer.
> > > Biomedical Text Mining Unit.
> > > Barcelona Supercomputing Centre
> > >
> >
> >
> > --
> > Siamak Barzegar, PhD.
> > Senior Research Engineer.
> > Biomedical Text Mining Unit.
> > Barcelona Supercomputing Centre
> >
>

Mime
View raw message