lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Thu, 07 Feb 2019 16:33:38 GMT
Hi Paul,

We have tried it with the space preceeding the \n i.e. <str
name="pattern">(\s*\n){2,}</str>, with the following regex pattern:

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\s*\n){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

However, we are also getting the exact same results as the earlier Example
1, 2 and 3.

As for your point 2 on perhaps in the data you have other (non printing)
characters than \n, we have find that there are no non printing characters.
It is just next line with a space. You can refer to the original content in
the same examples below.


Example 1: The sentence that the above regex pattern is working correctly
*Original content in EML file:*
Dear Sir,


I am terminating
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*

*exalted*

*Psalm 89:17*


3 Choa Chu Kang Avenue 4
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*

http://www.concordpri.moe.edu.sg/








On Tue, Dec 18, 2018 at 10:07 AM
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM


Appreciate any other ideas or suggestions that you may have.

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch> wrote:

> Hi Edwin
>
>
>
>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
> <str name="pattern">(\s*\n){2,}</str>
>   2.  Perhaps in the data you have other (non printing) characters than \n?
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 15:23
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> We have tried this suggested regex pattern as follow:
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\n\s*){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>
> But we still have exactly the same problem of Example 1,2 and 3 below.
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *    Dear Sir,  <br><br>I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
> Tue, Dec 18, 2018 at 10:07 AM
>
> Any further suggestion?
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch> wrote:
>
> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
> > part you could try
> >
> >
> >
> > <str name="pattern">(\n\s*){2,}</str>
> >
> >
> >
> > If you also want to match CRLF then
> >
> > <str name="pattern">(\r?\n\s*){2,}</str>
> >
> >
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> > Gesendet: Donnerstag, 7. Februar 2019 15:10
> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi Paul,
> >
> > Thanks for your reply.
> >
> > When I use this pattern:
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">(\n+\s*){2,}</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > </processor>
> >
> > It is working for some sentence within the same content and not working
> for
> > some sentences. Please see below for the one that is working and another
> > that is not working (partially working):
> >
> > Example 1: The sentence that the above regex pattern is working correctly
> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > *Index content: *    Dear Sir,  <br><br>I am terminating
> >
> > Example 2: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>)
> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> > Chu Kang Avenue 4, Singapore
> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3 Choa
> > Chu Kang Avenue 4, Singapore
> >
> > Example 3: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>)
> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> > \n\n
> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
> 2018
> > at 10:07 AM
> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> <br><br>On
> > Tue, Dec 18, 2018 at 10:07 AM
> >
> > We would appreciate your help to see what is wrong?
> >
> > Thank you.
> >
> > Regards,
> > Edwin
> >
> > On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch> wrote:
> >
> > > You don’t say what happens, just that it is not working. I assume
> nothing
> > > is replaced? Perhaps the pattern should be
> > >
> > >
> > >
> > >    <str name="pattern">"(\n\s*){2,}"</str>
> > >
> > >
> > >
> > > ??
> > >
> > >
> > >
> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > > Windows 10
> > >
> > >
> > >
> > > Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
> > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
> > >
> > >
> > >
> > > Hi,
> > >
> > > I am trying to use the RegexReplaceProcessorFactory to remove more than
> > two
> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
> > \n),
> > > and replace it with two <br>.
> > >
> > > I use the following regex pattern and it is working when I test it in
> > > regex101.com. But it is not working when I put it inside the
> > > RegexReplaceProcessorFactory as below:
> > >
> > > <updateRequestProcessorChain name="removeCode">
> > > <processor class="solr.RegexReplaceProcessorFactory">
> > >    <str name="fieldName">content</str>
> > >    <str name="pattern">"(\\n\s*){2,}"</str>
> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > > </processor>
> > >           </updateRequestProcessorChain>
> > >
> > > To explain further about my regex pattern, \s* is instructing the regex
> > to
> > > match any \n that have space after and {2,} is instructing the regex to
> > > match 2 or more occurrence of such pattern (\n).
> > >
> > > Please kindly let me know what is wrong and how should I do it?
> > >
> > > I am using Solr 7.6.0.
> > >
> > > Regards,
> > > Edwin
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message