ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Miller <timothy.mil...@childrens.harvard.edu>
Subject Re: sentence detector newline behavior
Date Tue, 21 May 2013 15:02:17 GMT
I think the whole reason to use a machine learning approach for sentence 
detection should be to help weigh evidence with these cases where hard 
rules cause problems, mainly 1) when a period does not end a sentence, 
but also 2) where a newline does and does not mean end of sentence. It 
is of course bad that in your example if you don't put a sentence break 
you will think that "extravascular findings" is negated. But it is also 
bad if you put a sentence break immediately after the word "and" at the 
end of a line and then you find that your language model thinks that 
"and <eos>" is a good bigram.

I will create a jira for the parameter thing, and try to implement it 
and see if it gets ok results with the existing model.

On 05/21/2013 10:11 AM, Masanz, James J. wrote:
> +1 for adding a boolean parameter, or perhaps instead a list of section IDs
> The sentence detector model was trained on data that always breaks at carriage returns.
> It is important for text that is a list something like this:
> Heart Rate: normal
> ENT: negative
> EXTRAVASCULAR FINDINGS: Severe prostatic enlargement.
> And without breaking on the line ending, the word negative would negate extravascular
> -----Original Message-----
> From: dev-return-1605-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1605-Masanz.James=mayo.edu@ctakes.apache.org]
On Behalf Of Miller, Timothy
> Sent: Tuesday, May 21, 2013 7:07 AM
> To: dev@ctakes.apache.org
> Subject: sentence detector newline behavior
> The sentence detector always ends a sentence where there are newlines.
> This is a problem for some notes (e.g. MIMIC radiology notes) where a
> line can wrap in the  middle of a sentence at specified character
> offsets. In the comments for SentenceDetector, it seems to be split up
> very logically in that it first runs the opennlp sentence detector, then
> breaks any detected sentence wherever there is a newline. Questions:
> 1) Would it be good to add a boolean parameter for breaking on newlines?
> 2) If that section was removed/avoided, does the opennlp sentence
> detector give good results given our model? Or is the model trained on
> text that always breaks at carriage returns?
> Tim

View raw message