ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Savova, Guergana" <Guergana.Sav...@childrens.harvard.edu>
Subject RE: sentence detector newline behavior
Date Tue, 21 May 2013 15:53:11 GMT
The OpenNLP sentence segmenter is trained on clinical data (cannot remember exactly how many
sentences were in the training corpus). This is the model distributed with cTAKES. The only
hard rule is the new line.

-----Original Message-----
From: Steven Bethard [mailto:steven.bethard@Colorado.EDU] 
Sent: Tuesday, May 21, 2013 11:38 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

On May 21, 2013, at 9:02 AM, Tim Miller <timothy.miller@childrens.harvard.edu> wrote:
> I think the whole reason to use a machine learning approach for 
> sentence detection should be to help weigh evidence with these cases 
> where hard rules cause problems, mainly 1) when a period does not end 
> a sentence, but also 2) where a newline does and does not mean end of sentence.

Perhaps we should consider re-training the OpenNLP sentence segmenter on some clinical data?
Presumably we can get sentences from the TreeBank annotations.

I don't know much about the OpenNLP sentence segmenter though. Does it only classify on periods?
We'd want to classify all periods and newlines. And we'd want to add features that capture
patterns like "XXX: YYY".


> It
> is of course bad that in your example if you don't put a sentence 
> break you will think that "extravascular findings" is negated. But it 
> is also bad if you put a sentence break immediately after the word 
> "and" at the end of a line and then you find that your language model 
> thinks that "and <eos>" is a good bigram.
> I will create a jira for the parameter thing, and try to implement it 
> and see if it gets ok results with the existing model.
> Tim
> On 05/21/2013 10:11 AM, Masanz, James J. wrote:
>> +1 for adding a boolean parameter, or perhaps instead a list of 
>> +section IDs
>> The sentence detector model was trained on data that always breaks at carriage returns.
>> It is important for text that is a list something like this:
>> Heart Rate: normal
>> ENT: negative
>> EXTRAVASCULAR FINDINGS: Severe prostatic enlargement.
>> And without breaking on the line ending, the word negative would 
>> negate extravascular findings
>> -----Original Message-----
>> From: dev-return-1605-Masanz.James=mayo.edu@ctakes.apache.org 
>> [mailto:dev-return-1605-Masanz.James=mayo.edu@ctakes.apache.org] On 
>> Behalf Of Miller, Timothy
>> Sent: Tuesday, May 21, 2013 7:07 AM
>> To: dev@ctakes.apache.org
>> Subject: sentence detector newline behavior
>> The sentence detector always ends a sentence where there are newlines.
>> This is a problem for some notes (e.g. MIMIC radiology notes) where a 
>> line can wrap in the  middle of a sentence at specified character 
>> offsets. In the comments for SentenceDetector, it seems to be split 
>> up very logically in that it first runs the opennlp sentence 
>> detector, then breaks any detected sentence wherever there is a newline. Questions:
>> 1) Would it be good to add a boolean parameter for breaking on newlines?
>> 2) If that section was removed/avoided, does the opennlp sentence 
>> detector give good results given our model? Or is the model trained 
>> on text that always breaks at carriage returns?
>> Tim

View raw message