ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Miller <timothy.mil...@childrens.harvard.edu>
Subject Re: sentence detector newline behavior
Date Thu, 23 May 2013 17:52:05 GMT
OK I've started doing this, was able to get training working on a very 
small example, will try doing slightly bigger.

On 05/22/2013 08:03 AM, Jörn Kottmann wrote:
> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>> That's awesome! It might be worth trying at least. How does the training
>> process change? Previously the training data would be one sentence per
>> line, but with newlines as possible mid-sentence characters that could
>> be trouble, is there a new representation for training data? Or would we
>> have to use the training api?
> Good point, yes that will be a problem with the default training 
> format, but it shouldn't be hard
> to solve. In the format itself we could define a new line tag e.g. 
> <NEWLINE> to mark new lines.
> as a hack to make it work with 1.5.3 you could instead use a special 
> char as a replacement
> for the new line char.
> When you pass the text down to the sentence detector a simple string 
> replace could be used to
> convert all new line chars to the special new line marker char.
> If things work out for you performance wise as well we will just 
> integrate it properly into OpenNLP
> for the next release.
> Could you produce a sentence detector training file with a new line 
> marker char?
> You should try to pick a char you can also pass in on a terminal 
> otherwise you have to use the
> API to train the model. The build in cross validation could be used to 
> evaluate the performance.
> Jörn

View raw message