lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-4864) RegexReplaceProcessorFactory should support pattern capture group substitution in replacement string
Date Thu, 03 Apr 2014 18:03:17 GMT

    [ https://issues.apache.org/jira/browse/SOLR-4864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959042#comment-13959042
] 

Steve Rowe commented on SOLR-4864:
----------------------------------

Hi Sunil,

Patch looks good, a few comments:

# Unless you're making changes to the import section of {{.java}} files, you shouldn't rearrange
them or add/subtract whitespace.  You may need to tell your IDE to stop doing this for you.
# In {{RegexReplaceProcessorFactory.init()}}, instead of {{args.removeArg()}} to get the boolean
{{literalReplacement}} param value, you should use {{args.removeBooleanArg()}}, because users
might specify this arg as {{<str name="literalReplacement">false</str>}} instead
of as a {{<bool>}}, resulting in a {{String}}-valued object returned by {{args.removeArg()}},
and your cast to {{Boolean}} would throw an exception.  ({{NamedList.removeBooleanArg()}}
handles type checking and parsing for you.)
# To test the ability to accept a {{String}}-valued boolean param, I recommend adding an {{<updateRequestProcessorChain>}}
with a {{<str>}}-valued {{literalReplacement}}  to {{solrconfig-update-processor-chains.xml}},
and adding a test for it to {{FieldMutatingUpdateProcessorTest.testRegexReplace()}}.
# I recommend adding new {{<updateRequestProcessorChain>}}-s in {{solrconfig-update-processor-chains.xml}},
and corresponding tests in {{FieldMutatingUpdateProcessorTest.testRegexReplace()}}, that verify
that when {{literalReplacement}} is {{true}}, either explicitly or by default, a text/pattern/replacement
combination that would trigger group substitution if {{literalReplacement}} were {{false}}
would *not* trigger group substitution.


> RegexReplaceProcessorFactory should support pattern capture group substitution in replacement
string
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-4864
>                 URL: https://issues.apache.org/jira/browse/SOLR-4864
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>    Affects Versions: 4.3
>            Reporter: Jack Krupansky
>         Attachments: SOLR-4864.patch
>
>
> It is unfortunate the the replacement string for RegexReplaceProcessorFactory is a pure,
"quoted" (escaped) literal and does not support pattern capture group substitution. This processor
should be enhanced to support full, standard pattern capture group substitution.
> The test case I used:
> {code}
>   <updateRequestProcessorChain name="regex-mark-special-words">
>     <processor class="solr.RegexReplaceProcessorFactory">
>       <str name="fieldRegex">.*</str>
>       <str name="pattern">([^a-zA-Z]|^)(cat|dog|fox)([^a-zA-Z]|$)</str>
>       <str name="replacement">$1&lt;&lt;$2&gt;&gt;$3</str>
>     </processor>
>     <processor class="solr.LogUpdateProcessorFactory" />
>     <processor class="solr.RunUpdateProcessorFactory" />
>   </updateRequestProcessorChain>
> {code}
> Indexing with this command against the standard Solr example with the above addition
to solrconfig:
> {code}
>   curl "http://localhost:8983/solr/update?commit=true&update.chain=regex-mark-special-words"
\
>   -H 'Content-type:application/json' -d '
>   [{"id": "doc-1",
>     "title": "Hello World",
>     "content": "The cat and the dog jumped over the fox.",
>     "other_ss": ["cat","cat bird", "lazy dog", "red fox den"]}]'
> {code}
> Alas, the resulting document consists of:
> {code}
>   "id":"doc-1",
>   "title":["Hello World"],
>   "content":["The$1<<$2>>$3and the$1<<$2>>$3jumped over the$1<<$2>>$3"],
>   "other_ss":["$1<<$2>>$3",
>     "$1<<$2>>$3bird",
>     "lazy$1<<$2>>$3",
>     "red$1<<$2>>$3den"],
> {code}
> The Javadoc for RegexReplaceProcessorFactory uses the exact same terminology of  "replacement
string", as does Java's Matcher.replaceAll, but clearly the semantics are distinct, with replaceAll
supporting pattern capture group substitution for its "replacement string", while RegexReplaceProcessorFactory
interprets "replacement string" as being a literal. At a minimum, the RegexReplaceProcessorFactory
Javadoc should explicitly state that the string is a literal that does not support pattern
capture group substitution.
> The relevant code in RegexReplaceProcessorFactory#init:
> {code}
> replacement = Matcher.quoteReplacement(replacementParam.toString());
> {code}
> Possible options for the enhancement:
> 1. Simply skip the quoteReplacement and fully support pattern capture group substitution
with no additional changes. Does have a minor backcompat issue.
> 2. Add an alternative to "replacement", say "nonQuotedReplacement" that is not quoted
as "replacement" is.
> 3. Add an option, say "quotedReplacement" that defaults to "true" for backcompat, but
can be set to "false" to support full replaceAll pattern capture group substitution.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message