lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roxana Danger <roxana.dan...@gmail.com>
Subject Re: Reusable tokenstream
Date Thu, 23 Nov 2017 10:18:41 GMT
That's great!! Got it.
Thank you very much.


On Wed, Nov 22, 2017 at 5:07 PM, Emir Arnautović <
emir.arnautovic@sematext.com> wrote:

> Hi Roxana,
> The idea with update request processor is to have following parameters:
> * inputField - document field with text to analyse
> * sharedAnalysis - field type with shared analysis definition
> * targetFields - comma separated list of fields where results should be
> stored.
> * fieldSpecificAnalysis - comma separated list of field types that defines
> specifics for each field (reusing schema will have extra tokenizer that
> should be ignored)
>
> Your update processor uses TeeSinkTokenFilter to create tokens for each
> field, but you do not write those tokens to index. You add new fields to
> document where each token is new value (or can concat and have whitespace
> tokenizer in indexing analysis chain of target field). You can remove
> inputField from document.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 22 Nov 2017, at 17:46, Roxana Danger <roxana.danger@gmail.com> wrote:
> >
> > Hi Emir,
> > In this case, I need more control at Lucene level, so I have to use the
> > lucene index writer directly. So, I can not use Solr for importing.
> > Or, is there anyway I can add a tokenstream to a SolrInputDocument (is
> > there any other class exposed by Solr during indexing that I can use for
> > this purpose?).
> > Am I correct or still missing something?
> > Thank you.
> >
> >
> > On Wed, Nov 22, 2017 at 11:33 AM, Emir Arnautović <
> > emir.arnautovic@sematext.com> wrote:
> >
> >> Hi Roxana,
> >> I think you can use https://lucene.apache.org/
> core/5_4_0/analyzers-common/
> >> org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html <
> >> https://lucene.apache.org/core/5_4_0/analyzers-common/
> >> org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html> like
> suggested
> >> earlier.
> >>
> >> HTH,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 22 Nov 2017, at 11:43, Roxana Danger <roxana.danger@gmail.com>
> wrote:
> >>>
> >>> Hi Emir,
> >>> Many thanks for your reply.
> >>> The UpdateProcessor can do this work, but is
> analyzer.reusableTokenStream
> >>> <https://lucene.apache.org/core/3_0_3/api/core/org/
> >> apache/lucene/analysis/Analyzer.html#reusableTokenStream(java.lang.
> String,
> >>> java.io.Reader)> the way to obtain a previous generated tokenstream?
is
> >> it
> >>> guarantee to get access to the token stream and not reconstruct it?
> >>> Thanks,
> >>> Roxana
> >>>
> >>>
> >>> On Wed, Nov 22, 2017 at 10:26 AM, Emir Arnautović <
> >>> emir.arnautovic@sematext.com> wrote:
> >>>
> >>>> Hi Roxana,
> >>>> I don’t think that it is possible. In some cases (seems like yours
is
> >> good
> >>>> fit) you could create custom update request processor that would do
> the
> >>>> shared analysis (you can have it defined in schema) and after analysis
> >> use
> >>>> those tokens to create new values for those two fields and remove
> source
> >>>> value (or flag it as ignored in schema).
> >>>>
> >>>> HTH,
> >>>> Emir
> >>>> --
> >>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>>
> >>>>
> >>>>
> >>>>> On 22 Nov 2017, at 11:09, Roxana Danger <roxana.danger@gmail.com>
> >> wrote:
> >>>>>
> >>>>> Hello all,
> >>>>>
> >>>>> I would like to reuse the tokenstream generated for one field, to
> >> create
> >>>> a
> >>>>> new tokenstream (adding a few filters to the available tokenstream),
> >> for
> >>>>> another field without the need of executing again the whole analysis.
> >>>>>
> >>>>> The particular application is:
> >>>>> - I have field *tokens* that uses an analyzer that generate the
> tokens
> >>>> (and
> >>>>> maintains the token type attributes)
> >>>>> - I would like to have another two new fields: *verbs* and
> >> *adjectives*.
> >>>>> These should reuse the tokenstream generated for the field *tokens*
> and
> >>>>> filter the verbs and adjectives for the respective fields.
> >>>>>
> >>>>> Is this feasible? How should it be implemented?
> >>>>>
> >>>>> Many thanks.
> >>>>
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message