ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vijay garla <vnga...@gmail.com>
Subject Re: sentence detector newline behavior
Date Mon, 27 Jan 2014 23:36:43 GMT
The opennlp model doesn't split on newlines - that is being done in the
analysis engine .  an alternative implementation of the analysis engine in
the ytex branch is available that does not split on newlines.

It works as in really really works.  The hard constraint you refer to is
not in the opennlp model.


On Monday, January 27, 2014, Tim Miller <
timothy.miller@childrens.harvard.edu> wrote:

> On 01/27/2014 06:03 PM, vijay garla wrote:
>> For clarity, I'd like to stress that the opennlp sentence model
>> distributed
>> with ctakes today does 'work' with sentences that span newlines - as I
>> understand it, this model ignores newline tokens (or newlines are not
>> provided as features to that model).
> Well, it depends on your definition of "works" :). It doesn't throw an
> exception but it automatically splits sentences at newlines. It is
> relatively normal to have text that "wraps" at ~80 characters with newlines
> added. It will look like this (this is made up text):
>    The patient was having difficulty
>    getting out of bed and was taking
>    aspirin in the morning. He has
>    returned today for a prescription
>    for something stronger.
> This style will cause multiple sentence fragments to be encoded which, as
> we've seen, will wreak havoc with negation detection.
>  I believe the improvements Tim and others are suggesting are for a new
>> sentence model + feature representation that takes advantage of newlines
>> as
>> features.
> To be precise, I'm proposing adding newlines to the set of characters that
> are candidates for end of sentences (i.e. decision points for the
> classifier), instead of having the hard constraint of splitting at all
> newlines.
>  Whatever we do, I believe we need backwards compatibility - those who are
>> using the current sentence model may need to continue using it.  To that
>> end:
>> * If we upgrade to the newest version of opennlp, will the old model work
>> (and produce the same results)?
> I definitely think we shouldn't release a new model that doesn't perform
> well in some absolute sense. But I think this change generalizes the old
> model, so that given that it meets that absolute standard a user should
> only see improvements. Specifically they should see fewer incorrect
> sentence fragments if they give us text with newlines in mid-sentence.
> IMHO, that kind of change doesn't require 'backwards compatibility' per se.
> Maybe we can make it an option to have a hard constraint that breaks on
> newlines but I think it should default to not do so.
>  * If a contributor trains a new model that uses a different feature
>> representation, I believe that should go into a new Sentence Detector
>> AnalysisEngine (or the same AE but with different configuration
>> parameters), so users have a choice between the old and the new.
> Yeah, I think having configuration parameters are fine as long as we have
> smart defaults.
> Thanks for your input VJ.
> Tim
>  -vj
> On Mon, Jan 27, 2014 at 1:09 PM, digital paula <cybersation@hotmail.com
> >wrote:
> Tim,
> I just had to chime in on a comment you made.    My deadline has been
> extended a bit on my pressing issue but I do intend to get back to testing
> per VJ's fix or maybe another fix is in the works based on latest
> emails...I need to read them again since a lot has been stated on the
> issue.
> Okay, as a new user (working w/cTAKES since October) I have never thought
> what you had stated:
>   "And I think this is the kind of thing that can leave new users
> scratching their heads and doubting our overall competence."
> Yeah, the sentence-spanning-newline issue was a problem so I just brought
> attention to it by my post of inquiry earlier this month on VJ's fix from
> last month and worked around it with treating narrative as one string.
> Anyone who's looked at the code would appreciate and acknowledge that
> cTAKES is a powerful and complex application.  I'm overall impressed with
> it and I intend to continue to use it, improve it, and grow with it.  I've
> been delving deeper into cTAKES on the machine learning aspect...I'm
> struggling a bit with it and if anything I scratch my head and doubt my
> competence. ;-)
> Regards,
> Paula
>  Date: Mon, 27 Jan 2014 09:52:00 -0500
> From: timothy.miller@childrens.harvard.edu
> To: dev@ctakes.apache.org
> Subject: Re: sentence detector newline behavior
> OK, with the most recent version I am able to replicate the performance
> I was getting before. Thanks a lot Jörn!
> Assuming this is in the next incremental release of opennlp, how quickly
> can we get a re-trained model into cTAKES? I heard from a researcher at
> AMIA who tried cTAKES and because of this bug in the way we handle
> sentences was trying to find an outside sentence detector as a
> preprocess to cTAKES, and frankly that is insane. We should be able to
> get something this simple right. And I think this is the kind of thing
> that can leave new users scratching their heads and doubting our overall
> competence.
> James, I believe you are usually the one who rebuilds the models? What
> would be the best way to incorporate the data I have that has some
> instances of non-sentence terminating newlines?
> Tim
> On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
> On 01/26/2014 11:29 PM, Miller, Timothy wrote:
> Yes, this fixes the whitespace sentence issue but the evaluation issue
> remains. I believe the problem is in SentenceSampleStream, where in
> the
> following block the whitespace trim happens before the <LF> character
> is
> replaced with the \n character. So test sentences that ended with <LF>
> will be one character longer than they should be.
>         sentence = sentence.trim();
>        sentence = replaceNewLineEscapeTags(sentence);
>        sentencesString.append(sentence);
>        int end = sentencesString.length();
>        sentenceSpans.add(new Span(begin, end));
>        sentencesString.append(' ');
> Yes, that must be the issue. During training the new line is inlucded
> in the span, and during
> detection the white space remover creates a span without the new line
> char.
> I suggest that the evaluator just ignores white space differences
> between sentences. My test case then
> has the expected performance numbers.
> What do you think?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message