ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Remy Sanouillet <re...@foreseemed.com>
Subject Re: Accessing the External Resource from the UimaContext without Using XML descriptor [EXTERNAL]
Date Fri, 28 Jun 2019 13:33:41 GMT
Hi Siamak,

I agree with Sean. Spelling correction in NLP is a bit of a tar baby. We
attempted to integrate CSpell (
https://lsg3.nlm.nih.gov/Specialist/Summary/cSpell.html) to improve recall.
Unfortunately we had to take if out because the overfitting affected
precision and increased ambiguity too much.

           Remy

On Fri, Jun 28, 2019 at 5:20 AM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> Hi Siamak,
>
> The problem of misspelled terms is a big one.  I have read about
> approaches taken by others for research, but nothing has been implemented
> for ctakes.
>
> The only thing that has been done on my projects is addition to the
> dictionary of common misspellings for a directed project.  For instance, in
> a project specifically addressing brain aneurysms I added to the (project)
> dictionary misspellings like "aneurism", "anurism" and "anurysm".  I didn't
> worry about misspellings for terms that didn't apply to the project; I
> didn't bother adding things like "skelatal" for "skeletal" because I didn't
> really care if that term was missed.
>
> Sean
> ________________________________________
> From: Siamak Barzegar <barzegar.siamak@gmail.com>
> Sent: Friday, June 28, 2019 6:12 AM
> To: dev@ctakes.apache.org
> Subject: Re: Accessing the External Resource from the UimaContext without
> Using XML descriptor [EXTERNAL]
>
> Dear Sean,
>
> Thank you very much for your help.
> As you suggested, I use "BsvRareWordDictionary" and create a BSV file for
> my small lexicon.
> I am using it in the Spanish medical documents. As you know medical
> documents have a lot of typos.  I was wondering to know is there any
> dictionary lookup in cTAKES or another component from other projects that
> can detect these small typos?
> for example, if we have this work in dictionary file:
> C0000001|T01|Fumador 2 paq*ue*tes
>
> And in the document, we have "fumador 2 paq*eu*tes". Is there any way to be
> able to annotate this typo word as well?
>
> With Best Wishes,
> Siamak
>
>
>
> On Tue, 25 Jun 2019 at 18:38, Finan, Sean <
> Sean.Finan@childrens.harvard.edu>
> wrote:
>
> > Ah.
> >
> > You are trying to use an old annotator.  It was never updated to be a
> > uimafit component and I think that it may not work with the
> PipelineBuilder.
> > Newer annotators have (for the most part) simpler interfaces and do not
> > require explicit specification of resources, resource types, etc.
> >
> > You have several options (worst to best):
> > 1.  Don't use PipelineBuilder
> > 2.  Wrap the older annotator in a uimafit-compatible component
> > 3.  Make a method that generates a description:
> >  UmlsDictionaryLookupAnnotator does this in a method named
> > createAnnotatorDescription()
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_trunk_ctakes-2Ddictionary-2Dlookup_src_main_java_org_apache_ctakes_dictionary_lookup_ae_UmlsDictionaryLookupAnnotator.java&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aNXh5Gc3ezd0x905RnW8e_Qa2SPMb_NqsaOGDBxoOh8&s=2RzyJ7sX-k2SpTfrXvoZLi3rJwdUer1mNva_-a78bGc&e=
> > -- Create the description and use the PIpelineBuilder addDescription(..)
> > method.
> > 4.  Use the newer fast dictionary instead of the old one.
> > -- The basic equivalent of the old *CSV annotator is
> > BsvRareWordDictionary.  It takes a single parameter "bsvPath".  Instead
> of
> > comma-separated values it wants Bar-separated values in the format
> > Cui|Synonym or Cui|Tui|Synonym
> > -- One misconception that people seem to have is that the "fast"
> > dictionary is faster but less accurate.  Actually, it is faster and more
> > accurate.  Speed was the greater difference and that name stuck.
> >
> > There may be other solutions, but those are what come to mind right now.
> >
> > Sean
> > ________________________________________
> > From: Siamak Barzegar <barzegar.siamak@gmail.com>
> > Sent: Tuesday, June 25, 2019 11:46 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: Accessing the External Resource from the UimaContext without
> > Using XML descriptor [EXTERNAL]
> >
> > Thank Sean,
> >
> > But it seems it is just fine for getting parameters, not external
> > resources,
> > please see this file:
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ctakes_blob_ctakes-2D4.0.0_ctakes-2Ddictionary-2Dlookup_desc_analysis-5Fengine_DictionaryLookupAnnotatorCSV.xml&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=sZCB2_P5UuzUubmiDmngwj2ZLc19r7Zt7iktjHGEcgc&s=tG9OvH7quP0-I-MP8HPRtfBvDQqkeRregjq4WJPjgTU&e=
> >
> > It has several externalResourceDependency that need to be run on
> > externalResource. How can I do it on the pipelinebiler? Do you any
> > suggestions?
> >
> > From Tutorial.ex6 from example UIMA:
> >
> > "When the Analysis Engine is initialized, it creates a single instance of
> > StringMapResource_impl and loads it with the contents of the data file.
> > This means that the framework calls the instance's load method, passing
> it
> > an instance of DataResource, from which you can obtain a stream or
> URI/URL
> > of the external resource that was declared in the external resource..."
> >
> > How can do the same for Resource Dependencies in
> > DictionalyLookuoAnnotatorCSV.xml?
> >
> > With Best Wishes,
> > Siamak
> >
> >
> > On Tue, 25 Jun 2019 at 16:38, Finan, Sean <
> > Sean.Finan@childrens.harvard.edu>
> > wrote:
> >
> > > Hi Siamak,
> > >
> > > Good question.  Yet another shortfall in the documentation ...
> > >
> > > There are several ways to set parameters in the  PipelineBuilder.
> > >
> > > The javadocs for the 4.0.0 release version are here:
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ctakes.apache.org_apidocs_4.0.0_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=sZCB2_P5UuzUubmiDmngwj2ZLc19r7Zt7iktjHGEcgc&s=jGYZiAKr_MMmm78sUVP7kSfsRbN8pHf1ZSdDba4uk7Y&e=
> > >
> > > You can use the set(..) method to set "global" values, or place
> > > component-specific values using the add(..) method.
> > >
> > > The PipelineBuilder in trunk has the additional method:
> > > setIfEmpty(..)        Just like set(..) except any given attributes are
> > > ignored if they already have values
> > >
> > > In addition, the add( component, parameters... ) in trunk has been
> > changed
> > > to:
> > > add( component, views, parameters ).
> > > Views are usually used for training ml models.  To use add(..) like the
> > > original (without special views) specify add( component,
> > > Collections.emptyList(), parameters ).   The method usage add(
> component
> > )
> > > still exists.  Apparently I was too lazy to properly refactor the
> method
> > > with the original signature ...
> > >
> > > I hope that helps,
> > > Sean
> > >
> > > ________________________________________
> > > From: Siamak Barzegar <barzegar.siamak@gmail.com>
> > > Sent: Tuesday, June 25, 2019 9:23 AM
> > > To: dev@ctakes.apache.org
> > > Subject: Accessing the External Resource from the UimaContext without
> > > Using XML descriptor [EXTERNAL]
> > >
> > > I would like to use different cTAKES' components by using
> PipelineBuilder
> > > (exactly the same in HelloWorldBuilderRunner.java).
> > > But the problem is (As I understand it), PipelineBuilder does not read
> > XML
> > > descriptor of the component. I want to use the Dictionary Lookup
> > component
> > > (DictionaryLookupannotatorCSV.xml) in the following components:
> > >
> > >          PipelineBuilder builder = new PipelineBuilder();
> > >          builder
> > >               .add( SimpleSegmentAnnotator.class )
> > >               .add( SentenceDetector.class )
> > >               .add( TokenizerAnnotator.class )
> > >                // Java Class file of DictionaryLookupannotatorCSV.xml
> is:
> > >               .add(DictionaryLookupAnnotator.class);
> > >
> > > But in the DictionaryLookupannotatorCSV.xml file, there are several
> > > external resources that DictionaryLookupAnnotator needs to read them:
> > >
> > > public void initialize(UimaContext aContext) {
> > >   iv_context = aContext;
> > >    ....
> > >   FileResource fResrc = (FileResource)
> > > iv_context.getResourceObject("LookupDescriptor");
> > >     ...
> > >    iv_lookupSpecSet = LookupParseUtilities.parseDescriptor(descFile,
> > > iv_context);
> > > }
> > >
> > > So, what is the best way for having access to these
> > > resources(LookupDescriptorFile, DictionaryFileResource, RxnormIndex and
> > > OrangeBookIndex) in DictionaryLookupannotatorCSV.xml from the code?
> > >
> > > Thanks a lot.
> > > Siamak
> > >
> >
> >
> > --
> > Siamak Barzegar, PhD.
> > Senior Research Engineer.
> > Biomedical Text Mining Unit.
> > Barcelona Supercomputing Centre
> >
>
>
> --
> Siamak Barzegar, PhD.
> Senior Research Engineer.
> Biomedical Text Mining Unit.
> Barcelona Supercomputing Centre
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message