lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (Commented) (JIRA)" <>
Subject [jira] [Commented] (SOLR-2802) Toolkit of UpdateProcessors for modifying document values
Date Sat, 01 Oct 2011 00:23:45 GMT


Hoss Man commented on SOLR-2802:

bq. I already have a FieldCopy processor which can copy/move fields,

Jan: Yeah ... I designed the base class arround the assumption that we would come up with
a good "clone fields" processor in SOLR-2599, so that they can simply modify the values "in
place" and people can clone/rename fields as needed before using them

bq. With SOLR-2599, I imagine we could take copyField's out of schema.xml,

Erik: I actually consider them very orthogonal.  Supporting cloning/copying in an update processor
is a way of saying "when docs are added to the index using this Update Chain, take these actions
on the fields" but copyField in schema.xml is a way of saying "no matter where this doc comes
from, the value of field X should also be put in field Y"

bq. Before we get too carried away, what about making this even more general purpose with
scripting, ala SOLR-1725 ?

We definitely should get the Script Processor in for people who don't know java but have specific
goals, but we shouldn't let support for scripting prevent us from implementing some of the
more commonly requested actions in java - there's a fine line between "you _can_ write scripts
to do _anything_ you want" and "you _have_ to write scripts to do _everything_ you want"

bq. There's one other update processor that perhaps could fit within this framework and become
something generally useful in Solr - SOLR-1280

I looked at that one before i started actually because of the "modify in place" nature of
this base class, it didn't really seem like a good fit to try and refactor that one to be
a subclass.

bq. I think in general that processors should match nothing by default. Could lead to unexpected
behaviour for users in the long run.

Martijn: I kept going back and forth on this while i was working on it.  Ultimately my thought
process was that it didn't really make sense for the "default" to be a No-Op because if that's
the case then what's the point of having a default at all?

And if we're going to require that they provide at least one of the field selectors, and we
want to offer them syntactic sugar for "match all field" why not make it the shortest sugar

I figured it would make sense for the base class to assume that "no args" ment let the subclass
see all of the fields/values -- and the subclasses could enforce their own rules default rules
as needed, ala...
* implicitly...
** in the TrimFieldUpdateProcessorFactory attached, it ignores anything that isn't an instance
of String -- regardless of how it's configured (so it doesn't call toString() on an Integer
and then try to trim that)
* explicitly
** i imagine that Date/Number parsing update processors should default to only trying to parse
fields where the FieldType extends DateField/TrieField (the Concat processor should probably
do the same for StrFields fields configured to be multiValued=false now that i think about
it).  But unlike how the Trim processor works, if they are explicitly configuring it to parse
fields named "foo.*" they should try to do so regardless of what the field type/settings might
be, because maybe a subsequent processor will renamed/move those fields in the input docs
to something that is expecting a Date/Number (or does support multivalued fields)

what do you think?

the scenario that still bothers me about all this is that if we put something like this in
the example schema...

<updateRequestProcessorChain name="simple" default="true">
 <processor class="solr.TrimFieldUpdateProcessorFactory" />
 <processor class="solr.LogUpdateProcessorFactory" />
 <processor class="solr.RunUpdateProcessorFactory" />

...(so all strings get trimmed) someone might say "Hey, stop trimming my strings!" and it's
easy for them to remove that from the example.  But someone else might say: "This is exactly
what i want _most_ of the time, but I've got this one field where whitespace matters, stop
trimming that one." -- and now he's got to jump through a lot of hoops to keep the trim behavior
on all but on field  (unless we add some sort of exclusion option(s)).  Even if we make some
field selection args mandatory for the processor and use this instead...

<updateRequestProcessorChain name="simple" default="true">
 <processor class="solr.TrimFieldUpdateProcessorFactory">
   <str name="fieldRegex">.*</str>
 <processor class="solr.LogUpdateProcessorFactory" />
 <processor class="solr.RunUpdateProcessorFactory" />

..that user still has the same amount of pain to deal with.

> Toolkit of UpdateProcessors for modifying document values
> ---------------------------------------------------------
>                 Key: SOLR-2802
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: SOLR-2802_update_processor_toolkit.patch
> Frequently users ask about questions about things where the answer is "you could do it
with an UpdateProcessor" but the number of our of hte box UpdateProcessors is generally lacking
and there aren't even very good base classes for the common case of manipulating field values
when adding documents

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message