ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject FW: Facing issues with cTakes confidence score [EXTERNAL]
Date Tue, 02 Jan 2018 17:35:00 GMT
Just in case somebody else has already done this or has any ideas, I am forwarding the question
and one answer to:
> How do we find out which entity has more relevance to the document. I need this, as we
need to limit our outputs to max 10 terms for one clinical document.


From: Finan, Sean
Sent: Tuesday, January 02, 2018 12:31 PM
To: 'Ratan Sharma'
Subject: RE: Facing issues with cTakes confidence score [EXTERNAL]

Hi Ratan,

There are a couple of things that you can do, but getting down to 10 terms per a note will
be difficult.

The first thing to do is go into your resources and edit your dictionary’s setup xml file.
 It is either in ctakes-dictionary-lookup-fast-res/ or resources/ depending upon how you are
running.  Go all the way to the end of org/ctakes/dictionary/lookup/fast/   At the bottom
of the xml file you will see a couple of commented lines, one with “PrecisionTermConsumer”.
 Uncomment that line and comment out the line with “DefaultTermConsumer”.  This will limit
mentions so that you will get things like “lung cancer” instead of both “lung cancer”
and “cancer” – “lung cancer” being the more specific disease.  You will still get
“lung” in each case as an anatomical site.

The second thing that you can do is build up a map of counts per CUI.  You can get a map of
cuis and the number of times they appear in the document (Map<String,Long>) with the
following command:
OntologyConceptUtil.getCuiCounts( jCas )

You can sort by the number of appearances and grab the top 10.  Another thing that might help
is filtering out the negated concepts.  Something like:
Map<String,Long> topTenYes =
JCaseUtil.select( jCas, IdentifiedAnnotation.class ).stream()
.filter( ia -> ia.getPolarity != CONST.NE_POLARITY_NEGATION_PRESENT )
.map( OntologyConceptUtil::getCuiCounts )

Another thing to do would be to filter out by subject.  For each identified annotation use
.getSubject().equals( CONST.ATTR_SUBJECT_PATIENT ).
Related to subject, you can filter out identified annotations in sections like family history.
 Use JCasUtil.selectCovered( jCas, Segment.class, IdentifiedAnnotation.class ) and filter
out when by checking each segment’s .getPreferredText().  If the preferred text is “Family
Medical History” then you can probably discount everything in that section.
Likewise, if the mentions are in things like “Patient History” then they may not have
to do with the current encounter.  You can find section names in the ctakes-core-res DefaultSectionRegex.bsv
file.  You will need to have the BsvRegexSectionizer in your pipeline.  I would use the SectionedFastPipeline
piper in ctakes-clinical-pipeline-res and your custom filtering annotator to the end of it.

Lastly, if you use the temporal modules you can filter by the time relative to the document
time (doc time rel) being overlap or before/overlap.  Use the SectionedTemporalPipeline piper
in ctakes-temporal-res.  Then some code like the following:
If ( annotation instanceof EventMention ) {
   Final Event event = ((EventMention)annotation).getEvent();
   If ( event != null ) {
      Final EventProperties properties = event.getProperties();
      If ( properties != null ) {
         Final String doctimerel = properties.getDocTimeRel();
         Final Boolean keepThisAnnotation = doctimerel != null && doctimerel.contains(
“Overlap” );

That should give you a start.  I am not sure how much each will help, but they are suggestions
of things that you can try.


From: Ratan Sharma [mailto:ratancomp@gmail.com]
Sent: Tuesday, January 02, 2018 11:20 AM
To: Finan, Sean
Subject: Re: Facing issues with cTakes confidence score [EXTERNAL]

Thanks Sean for the reply.

So is there no way we can assign relevance/confidence of entities. How do we find out which
entity has more relevance to the document. I need this, as we need to limit our outputs to
max 10 terms for one clinical document.

Thanks for your time on this. Really appreciate it.

On Tue, Jan 2, 2018 at 9:00 PM, Finan, Sean <Sean.Finan@childrens.harvard.edu<mailto:Sean.Finan@childrens.harvard.edu>>
Hi Ratan,

What Tim said is absolutely correct.  Those mentions are all discovered by dictionary lookup
procedures.  The default procedure is strict lookup against a term in the dictionary database
and no lookup has any more validity than any other, so “confidence” is pretty meaningless.

Confidence can be introduced by other modules and for various reasons, but for creation of
mentions using standard ctakes that value is never set.


From: Ratan Sharma [mailto:ratancomp@gmail.com<mailto:ratancomp@gmail.com>]
Sent: Saturday, December 30, 2017 5:23 AM
To: Finan, Sean
Subject: Facing issues with cTakes confidence score [EXTERNAL]

Hi Sean,

Can you please add your thoughts to this query :


I am looking for a way to distinguish which entity has higher weight-age than others..like
a relevance score for each entity.

Is it possible we can have a meeting to discuss this. Anytime of yours is fine with me.

Thank you.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message