lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ziqi Zhang <>
Subject Re: tokenize into sentences/sentence splitter
Date Wed, 23 Sep 2015 15:39:28 GMT
Thanks that is understood.

My application is a bit special in the way that I need both an indexed 
field with standard tokenization and an unindexed but stored field of 
sentences. Both must be present for each document.

I could possibly do with PatternTokenizer, but that is of course, less 
accurate than e.g., wrapping OpenNLP sentence splitter in a lucene 

On 23/09/2015 16:23, Doug Turnbull wrote:
> Sentence recognition is usually an NLP problem. Probably best handled
> outside of Solr. For example, you probably want to train and run a sentence
> recognition algorithm, inject a sentence delimiter, then use that delimiter
> as the basis for tokenization.
> More info on sentence recognition
> On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang <>
> wrote:
>> Hi
>> I need a special kind of 'token' which is a sentence, so I need a
>> tokenizer that splits texts into sentences.
>> I wonder if there is already such or similar implementations?
>> If I have to implement it myself, I suppose I need to implement a subclass
>> of Tokenizer. Having looked at a few existing implementations, it does not
>> look very straightforward how to do it. A few pointers would be highly
>> appreciated.
>> Many thanks
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

Ziqi Zhang
Research Associate
Department of Computer Science
University of Sheffield

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message