ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Miller <timothy.mil...@childrens.harvard.edu>
Subject Re: sentence detector newline behavior
Date Mon, 27 Jan 2014 23:41:27 GMT

On 01/27/2014 06:36 PM, vijay garla wrote:
> The opennlp model doesn't split on newlines - that is being done in the
> analysis engine .  an alternative implementation of the analysis engine in
> the ytex branch is available that does not split on newlines.
> It works as in really really works.  The hard constraint you refer to is
> not in the opennlp model.
Oh yeah for sure. I'm talking about fixing the ctakes issue though so 
that the default out of the box behavior is what a user expects. If that 
means using the ytex sentence splitter as the default that would be fine 
with me.

> Vj
> On Monday, January 27, 2014, Tim Miller <
> timothy.miller@childrens.harvard.edu> wrote:
>> On 01/27/2014 06:03 PM, vijay garla wrote:
>>> For clarity, I'd like to stress that the opennlp sentence model
>>> distributed
>>> with ctakes today does 'work' with sentences that span newlines - as I
>>> understand it, this model ignores newline tokens (or newlines are not
>>> provided as features to that model).
>> Well, it depends on your definition of "works" :). It doesn't throw an
>> exception but it automatically splits sentences at newlines. It is
>> relatively normal to have text that "wraps" at ~80 characters with newlines
>> added. It will look like this (this is made up text):
>>     The patient was having difficulty
>>     getting out of bed and was taking
>>     aspirin in the morning. He has
>>     returned today for a prescription
>>     for something stronger.
>> This style will cause multiple sentence fragments to be encoded which, as
>> we've seen, will wreak havoc with negation detection.
>>   I believe the improvements Tim and others are suggesting are for a new
>>> sentence model + feature representation that takes advantage of newlines
>>> as
>>> features.
>> To be precise, I'm proposing adding newlines to the set of characters that
>> are candidates for end of sentences (i.e. decision points for the
>> classifier), instead of having the hard constraint of splitting at all
>> newlines.
>>   Whatever we do, I believe we need backwards compatibility - those who are
>>> using the current sentence model may need to continue using it.  To that
>>> end:
>>> * If we upgrade to the newest version of opennlp, will the old model work
>>> (and produce the same results)?
>> I definitely think we shouldn't release a new model that doesn't perform
>> well in some absolute sense. But I think this change generalizes the old
>> model, so that given that it meets that absolute standard a user should
>> only see improvements. Specifically they should see fewer incorrect
>> sentence fragments if they give us text with newlines in mid-sentence.
>> IMHO, that kind of change doesn't require 'backwards compatibility' per se.
>> Maybe we can make it an option to have a hard constraint that breaks on
>> newlines but I think it should default to not do so.
>>   * If a contributor trains a new model that uses a different feature
>>> representation, I believe that should go into a new Sentence Detector
>>> AnalysisEngine (or the same AE but with different configuration
>>> parameters), so users have a choice between the old and the new.
>> Yeah, I think having configuration parameters are fine as long as we have
>> smart defaults.
>> Thanks for your input VJ.
>> Tim
>>   -vj
>> On Mon, Jan 27, 2014 at 1:09 PM, digital paula <cybersation@hotmail.com
>>> wrote:
>> Tim,
>> I just had to chime in on a comment you made.    My deadline has been
>> extended a bit on my pressing issue but I do intend to get back to testing
>> per VJ's fix or maybe another fix is in the works based on latest
>> emails...I need to read them again since a lot has been stated on the
>> issue.
>> Okay, as a new user (working w/cTAKES since October) I have never thought
>> what you had stated:
>>    "And I think this is the kind of thing that can leave new users
>> scratching their heads and doubting our overall competence."
>> Yeah, the sentence-spanning-newline issue was a problem so I just brought
>> attention to it by my post of inquiry earlier this month on VJ's fix from
>> last month and worked around it with treating narrative as one string.
>> Anyone who's looked at the code would appreciate and acknowledge that
>> cTAKES is a powerful and complex application.  I'm overall impressed with
>> it and I intend to continue to use it, improve it, and grow with it.  I've
>> been delving deeper into cTAKES on the machine learning aspect...I'm
>> struggling a bit with it and if anything I scratch my head and doubt my
>> competence. ;-)
>> Regards,
>> Paula
>>   Date: Mon, 27 Jan 2014 09:52:00 -0500
>> From: timothy.miller@childrens.harvard.edu
>> To: dev@ctakes.apache.org
>> Subject: Re: sentence detector newline behavior
>> OK, with the most recent version I am able to replicate the performance
>> I was getting before. Thanks a lot Jörn!
>> Assuming this is in the next incremental release of opennlp, how quickly
>> can we get a re-trained model into cTAKES? I heard from a researcher at
>> AMIA who tried cTAKES and because of this bug in the way we handle
>> sentences was trying to find an outside sentence detector as a
>> preprocess to cTAKES, and frankly that is insane. We should be able to
>> get something this simple right. And I think this is the kind of thing
>> that can leave new users scratching their heads and doubting our overall
>> competence.
>> James, I believe you are usually the one who rebuilds the models? What
>> would be the best way to incorporate the data I have that has some
>> instances of non-sentence terminating newlines?
>> Tim
>> On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
>> On 01/26/2014 11:29 PM, Miller, Timothy wrote:
>> Yes, this fixes the whitespace sentence issue but the evaluation issue
>> remains. I believe the problem is in SentenceSampleStream, where in
>> the
>> following block the whitespace trim happens before the <LF> character
>> is
>> replaced with the \n character. So test sentences that ended with <LF>
>> will be one character longer than they should be.
>>          sentence = sentence.trim();
>>         sentence = replaceNewLineEscapeTags(sentence);
>>         sentencesString.append(sentence);
>>         int end = sentencesString.length();
>>         sentenceSpans.add(new Span(begin, end));
>>         sentencesString.append(' ');
>> Yes, that must be the issue. During training the new line is inlucded
>> in the span, and during
>> detection the white space remover creates a span without the new line
>> char.
>> I suggest that the evaluator just ignores white space differences
>> between sentences. My test case then
>> has the expected performance numbers.
>> What do you think?

View raw message