lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: Reusable tokenstream
Date Wed, 22 Nov 2017 17:07:55 GMT
Hi Roxana,
The idea with update request processor is to have following parameters:
* inputField - document field with text to analyse
* sharedAnalysis - field type with shared analysis definition
* targetFields - comma separated list of fields where results should be stored.
* fieldSpecificAnalysis - comma separated list of field types that defines specifics for each
field (reusing schema will have extra tokenizer that should be ignored)

Your update processor uses TeeSinkTokenFilter to create tokens for each field, but you do
not write those tokens to index. You add new fields to document where each token is new value
(or can concat and have whitespace tokenizer in indexing analysis chain of target field).
You can remove inputField from document.

HTH,
Emir 
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Nov 2017, at 17:46, Roxana Danger <roxana.danger@gmail.com> wrote:
> 
> Hi Emir,
> In this case, I need more control at Lucene level, so I have to use the
> lucene index writer directly. So, I can not use Solr for importing.
> Or, is there anyway I can add a tokenstream to a SolrInputDocument (is
> there any other class exposed by Solr during indexing that I can use for
> this purpose?).
> Am I correct or still missing something?
> Thank you.
> 
> 
> On Wed, Nov 22, 2017 at 11:33 AM, Emir Arnautović <
> emir.arnautovic@sematext.com> wrote:
> 
>> Hi Roxana,
>> I think you can use https://lucene.apache.org/core/5_4_0/analyzers-common/
>> org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html <
>> https://lucene.apache.org/core/5_4_0/analyzers-common/
>> org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html> like suggested
>> earlier.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 22 Nov 2017, at 11:43, Roxana Danger <roxana.danger@gmail.com> wrote:
>>> 
>>> Hi Emir,
>>> Many thanks for your reply.
>>> The UpdateProcessor can do this work, but is analyzer.reusableTokenStream
>>> <https://lucene.apache.org/core/3_0_3/api/core/org/
>> apache/lucene/analysis/Analyzer.html#reusableTokenStream(java.lang.String,
>>> java.io.Reader)> the way to obtain a previous generated tokenstream? is
>> it
>>> guarantee to get access to the token stream and not reconstruct it?
>>> Thanks,
>>> Roxana
>>> 
>>> 
>>> On Wed, Nov 22, 2017 at 10:26 AM, Emir Arnautović <
>>> emir.arnautovic@sematext.com> wrote:
>>> 
>>>> Hi Roxana,
>>>> I don’t think that it is possible. In some cases (seems like yours is
>> good
>>>> fit) you could create custom update request processor that would do the
>>>> shared analysis (you can have it defined in schema) and after analysis
>> use
>>>> those tokens to create new values for those two fields and remove source
>>>> value (or flag it as ignored in schema).
>>>> 
>>>> HTH,
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 22 Nov 2017, at 11:09, Roxana Danger <roxana.danger@gmail.com>
>> wrote:
>>>>> 
>>>>> Hello all,
>>>>> 
>>>>> I would like to reuse the tokenstream generated for one field, to
>> create
>>>> a
>>>>> new tokenstream (adding a few filters to the available tokenstream),
>> for
>>>>> another field without the need of executing again the whole analysis.
>>>>> 
>>>>> The particular application is:
>>>>> - I have field *tokens* that uses an analyzer that generate the tokens
>>>> (and
>>>>> maintains the token type attributes)
>>>>> - I would like to have another two new fields: *verbs* and
>> *adjectives*.
>>>>> These should reuse the tokenstream generated for the field *tokens* and
>>>>> filter the verbs and adjectives for the respective fields.
>>>>> 
>>>>> Is this feasible? How should it be implemented?
>>>>> 
>>>>> Many thanks.
>>>> 
>>>> 
>> 
>> 


Mime
View raw message