lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Thu, 07 Feb 2019 14:10:00 GMT
Hi Paul,

Thanks for your reply.

When I use this pattern:
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n+\s*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

It is working for some sentence within the same content and not working for
some sentences. Please see below for the one that is working and another
that is not working (partially working):

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM

We would appreciate your help to see what is wrong?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch> wrote:

> You don’t say what happens, just that it is not working. I assume nothing
> is replaced? Perhaps the pattern should be
>
>
>
>    <str name="pattern">"(\n\s*){2,}"</str>
>
>
>
> ??
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 14:08
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> I am trying to use the RegexReplaceProcessorFactory to remove more than two
> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
> and replace it with two <br>.
>
> I use the following regex pattern and it is working when I test it in
> regex101.com. But it is not working when I put it inside the
> RegexReplaceProcessorFactory as below:
>
> <updateRequestProcessorChain name="removeCode">
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">"(\\n\s*){2,}"</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>           </updateRequestProcessorChain>
>
> To explain further about my regex pattern, \s* is instructing the regex to
> match any \n that have space after and {2,} is instructing the regex to
> match 2 or more occurrence of such pattern (\n).
>
> Please kindly let me know what is wrong and how should I do it?
>
> I am using Solr 7.6.0.
>
> Regards,
> Edwin
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message