lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <ben...@basistech.com>
Subject Re: Text dependent analyzer
Date Fri, 17 Apr 2015 17:06:19 GMT
If you wait tokenization to depend on sentences, and you insist on
being inside Lucene, you have to be a Tokenizer. Your tokenizer can
set an attribute on the token that ends a sentence. Then, downstream,
filters can  read-ahead tokens to get the full sentence and buffer
tokens as needed.



On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan <iorixxx@yahoo.com.invalid> wrote:
> Hi Hummel,
>
> There was an effort to bring open-nlp capabilities to Lucene:
> https://issues.apache.org/jira/browse/LUCENE-2899
>
> Lance was working on it to keep it up-to-date. But, it looks like it is not always best
to accomplish all things inside Lucene.
> I personally would do the sentence detection outside of the Lucene.
>
> By the way, I remember there was a way to consume all upstream token stream.
>
> I think it was consuming all input and injecting one concatenated huge term/token.
>
> KeywordTokenizer has similar behaviour. It injects a single token.
> http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html
>
> Ahmet
>
>
> On Wednesday, April 15, 2015 3:12 PM, Shay Hummel <shay.hummel@gmail.com> wrote:
> Hi Ahment,
> Thank you for the reply,
> That's exactly what I am doing. At the moment, to index a document, I break
> it to sentences, and each sentence is analyzed (lemmatizing, stopword
> removal etc.)
> Now, what I am looking for is a way to create an analyzer (a class which
> extends lucene's analyzer). This analyzer will be used for index and query
> processing. It (a like the english analyzer) will receive the text and
> produce tokens.
> The Api of Analyzer requires implementing the createComponents which
> is not dependent
> on the text being analyzed. This fact is problematic since as you know the
> OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
> model files to provide spans of each sentence and then break them).
> Is there a way around it?
>
> Shay
>
>
> On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <iorixxx@yahoo.com.invalid>
> wrote:
>
>> Hi Hummel,
>>
>> You can perform sentence detection outside of the solr, using opennlp for
>> instance, and then feed them to solr.
>>
>> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
>>
>> Ahmet
>>
>>
>>
>>
>> On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <shay.hummel@gmail.com>
>> wrote:
>> Hi
>> I would like to create a text dependent analyzer.
>> That is, *given a string*, the analyzer will:
>> 1. Read the entire text and break it into sentences.
>> 2. Each sentence will then be tokenized, possesive removal, lowercased,
>> mark terms and stemmed.
>>
>> The second part is essentially what happens in english analyzer
>> (createComponent). However, this is not dependent of the text it receives -
>> which is the first part of what I am trying to do.
>>
>> So ... How can it be achieved?
>>
>> Thank you,
>>
>> Shay Hummel
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message