ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Miller <timothy.mil...@childrens.harvard.edu>
Subject Re: apostrophe and sentence detector
Date Mon, 26 Aug 2013 16:34:49 GMT
Ah, so we might suspect that some of those 7 lines in the file were 
indeed followed by newlines in the original training data. In the 
absence of more/better training data which would help us learn this I 
think it would be reasonable to restore the list of sentence-breaking 
characters to not include apostrophe. Seems like it is rare for a 
sentence to end on it, and my preference is to accidentally call 2 
sentences one sentence, rather than splitting one sentence in the 
middle. I think it's probably better for downstream processing.
Just my .02,

On 08/26/2013 12:29 PM, Masanz, James J. wrote:
> The training data is one sentence per line.
> That's how you feed data to the sentence detector.
> -----Original Message-----
> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org]
On Behalf Of Tim Miller
> Sent: Monday, August 26, 2013 11:12 AM
> To: dev@ctakes.apache.org
> Subject: Re: apostrophe and sentence detector
> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch)
is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model
>> The training data used for the recently rebuilt model only contains only 7 lines
that end with an apostrophe (single quote)
> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
> sentence detector will currently break on newlines no matter what, so
> the important number is how many sentences end mid-line with an
> apostrophe, right?
> Tim

View raw message