ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: sentence detector newline behavior
Date Mon, 20 Jan 2014 13:25:03 GMT
Hi all,

currently I have quite a bit of time to work on OpenNLP, and would like 
to help you
out with this issue.

Here is the follow up issue for this change:

I am still trying to figure out what would be the best option to 
implement this.
In the training data a user could just use a special tag to identify the 

Instead of <NEWLINE> it might be better to use <CR> and <LF> to encode 
these two chars
in the training data. Any thoughts?

I am planning to release this as part of OpenNLP 1.6.0.


On 05/22/2013 02:03 PM, Jörn Kottmann wrote:
> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>> That's awesome! It might be worth trying at least. How does the training
>> process change? Previously the training data would be one sentence per
>> line, but with newlines as possible mid-sentence characters that could
>> be trouble, is there a new representation for training data? Or would we
>> have to use the training api?
> Good point, yes that will be a problem with the default training 
> format, but it shouldn't be hard
> to solve. In the format itself we could define a new line tag e.g. 
> <NEWLINE> to mark new lines.
> as a hack to make it work with 1.5.3 you could instead use a special 
> char as a replacement
> for the new line char.
> When you pass the text down to the sentence detector a simple string 
> replace could be used to
> convert all new line chars to the special new line marker char.
> If things work out for you performance wise as well we will just 
> integrate it properly into OpenNLP
> for the next release.
> Could you produce a sentence detector training file with a new line 
> marker char?
> You should try to pick a char you can also pass in on a terminal 
> otherwise you have to use the
> API to train the model. The build in cross validation could be used to 
> evaluate the performance.
> Jörn

View raw message