lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Composition of multiple smaller fields into another larger field?
Date Thu, 08 May 2008 22:48:22 GMT

: 1) Is there an existing feature, approach, mechanism, ... to get this 
: done that I'm just not aware of?

Currently, the only way this can be done Solr Side is with an 
UpdateProcessor.

: 2) Assuming that #1 is 'no', then would this be a generally useful 
: feature to add in? If so how would people like this to be done?

This is very similar to the idea i've had floating arround that i 
mentioned recently...

http://www.nabble.com/Tokenize-integers--to17040305.html#a17075603

i've been considering the idea of allowing fieldTypes to declare an 
<analyzer type="stored" ...> which would be used to preprocess the stored 
value using normal Tokenizers and TokenFilters to do anything people 
wanted -- the resulting tokens would be treated as normal multi-valued 
field values are today (obviously a "ConcatFilter" and some 
"FormatFilters" would be needed for cases were you want to tear something 
down, mangle the pieces and then and then build it back up)  using things 
like the TeeFilter and the SinkTokenizer optimizations could be made when 
you want some common processing forboth the "stored" values and the 
"indexed" values.

The specific example you give...

:     <composeField source="{city}, {state} {zipcode}" dest="suggest_full"  />

...is an interesting one.  i had only been considering reformatting values 
(ie: comma seperated floats become real SortableFloats, dates get parsed 
from alternate formates, numbers extracted from paragraphs of text, 
etc...) but i hadn't relaly considered how it would interact with 
copyField.

right now i don't think Solr garuntees that copyFields will be applied in 
any set order, but if we said that they would be evaluated in the order 
they are declared in your schema.xml, then what you describe could be done 
using the idea i had with something like...

  <fieldType name="fullAddr" class="StrField>>
   <analyzer type="stored">
     <!--don't need any tokenizing of atomic values -->
     <tokenizer class="KeywordTokenizer"/>
     <!-- new buffering filter, waits until it's got enough tokens 
          to fill the format.  has options to decide what to do if 
          not enough tokens are recieved, or more come after 
      -->
     <tokenfilter class="FormatFilter" format="{0}, {1} {2}" />
   </analyzer>
   ...

  <copyField src="city"  dest="suggest_full" />
  <copyField src="state" dest="suggest_full" />
  <copyField src="zip"   dest="suggest_full" />


...but i freely admit, i haven't thought the idea all the way through, let 
alone for the usecase you describe (and in general, i still haven't 
convinced myself (ab)using Analyzers to process "stored" text isn't dirty 
and morally wrong)


-Hoss


Mime
View raw message