lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Au <bill.w...@gmail.com>
Subject Re: Using Lucene's payload in Solr
Date Fri, 21 Aug 2009 16:04:42 GMT
I ended up not using an XML attribute for the payload since I need to return
the payload in query response.  So I ended up going with:

<field name="title">2.0|Solr In Action</field>

My payload is numeric so I can pick a non-numeric delimiter (ie '|').
Putting the payload in front means I don't have to worry about the delimiter
appearing in the value.  The payload is required in my case so I can simply
look for the first occurrence of the delimiter and ignore the possibility of
the delimiter appearing in the value.

I ended up writing a custom Tokenizer and a copy field with a
PatternTokenizerFactory to filter out the delimiter and payload.  That's is
straight forward in terms of implementation.  On top of that I can still use
the CSV loader, which I really like because of its speed.

Bill.

On Thu, Aug 20, 2009 at 10:36 PM, Chris Hostetter
<hossman_lucene@fucit.org>wrote:

>
> : of the field are correct but the delimiter and payload are stored so they
> : appear in the response also.  Here is an example:
>         ...
> : I am thinking maybe I can do this instead when indexing:
> :
> : XML for indexing:
> : <field name="title" payload="2.0">Solr In Action</field>
> :
> : This will simplify indexing as I don't have to repeat the payload for
> each
>
> but now you're into a custom request handler for the updates to deal with
> the custom XML attribute so you can't use DIH, or CSV loading.
>
> It seems like it might be simpler have two new (generic) UpdateProcessors:
> one that can clone fieldA into fieldB, and one that can do regex mutations
> on fieldB ... neither needs to know about payloads at all, but the first
> can made a copy of "2.0|Solr In Action" and the second can strip off the
> "2.0|" from the copy.
>
> then you can write a new NumericPayloadRegexTokenizer that takes in two
> regex expressions -- one that knows how to extract the payload from a
> piece of input, and one that specifies the tokenization.
>
> those three classes seem easier to implemnt, easier to maintain, and more
> generally reusable then a custom xml request handler for your updates.
>
>
> -Hoss
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message