lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <paul.d...@ub.unibe.ch>
Subject AW: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Wed, 20 Feb 2019 08:01:07 GMT
BTW, which Java Version are you using?



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
Gesendet: Mittwoch, 20. Februar 2019 08:13
An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi,

Thanks for the reply.

Do you know of any regex online tool that works correctly for Java regex?
I tried to find some, but they are not working properly.

Yes, our plan is to replace more than one \n with <br><br>, and single \n
with single <br>.

Regards,
Edwin

On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfranke@gmail.com> wrote:

> Solr uses Java regex matching, so i doubt there is a bug - it would then
> be in the JDK. Try out in a regex online Tool that supports Java regex for
> your solution.
>
> I believe you want to have 2 regex process factories:
> One that deals with single \n and one that deals with more than one \n
>
> > Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >:
> >
> > Hi,
> >
> > We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > configuration:
> >
> > <processor class="solr.RegexReplaceProcessorFactory">
> >   <str name="fieldName">content</str>
> >   <str name="pattern">([ \t]*\r?\n){2,}</str>
> >   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >   <bool name="literalReplacement">true</bool>
> > </processor>
> >
> > However, the issue is still occurring.
> >
> > Anyone else is able to help?
> >
> > Regards,
> > Edwin
> >
> > On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> For your info, this issue is occurring in Solr 7.7.0 as well.
> >>
> >> Regards,
> >> Edwin
> >>
> >> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Should we report this as a bug in Solr?
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Hi Paul,
> >>>>
> >>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> >>>> https://regex101.com/, it is able to give us the correct result for
> all
> >>>> the examples (ie: All of them will only have <br><br>, and
not more
> than
> >>>> that like what we are getting in Solr in our earlier examples).
> >>>>
> >>>> Could there be a possibility of a bug in Solr?
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Paul,
> >>>>>
> >>>>> We have tried it with the space preceeding the \n i.e. <str
> >>>>> name="pattern">(\s*\n){2,}</str>, with the following regex
pattern:
> >>>>>
> >>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>   <str name="fieldName">content</str>
> >>>>>   <str name="pattern">(\s*\n){2,}</str>
> >>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>> </processor>
> >>>>>
> >>>>> However, we are also getting the exact same results as the earlier
> >>>>> Example 1, 2 and 3.
> >>>>>
> >>>>> As for your point 2 on perhaps in the data you have other (non
> >>>>> printing) characters than \n, we have find that there are no non
> printing
> >>>>> characters. It is just next line with a space. You can refer to
the
> >>>>> original content in the same examples below.
> >>>>>
> >>>>>
> >>>>> Example 1: The sentence that the above regex pattern is working
> >>>>> correctly
> >>>>> *Original content in EML file:*
> >>>>> Dear Sir,
> >>>>>
> >>>>>
> >>>>> I am terminating
> >>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>
> >>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>> *Original content in EML file:*
> >>>>>
> >>>>> *exalted*
> >>>>>
> >>>>> *Psalm 89:17*
> >>>>>
> >>>>>
> >>>>> 3 Choa Chu Kang Avenue 4
> >>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n
 3
> >>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
> >>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>
> >>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>> *Original content in EML file:*
> >>>>>
> >>>>> http://www.concordpri.moe.edu.sg/
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
> \n
> >>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> Dec 18,
> >>>>> 2018 at 10:07 AM
> >>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>
> >>>>>
> >>>>> Appreciate any other ideas or suggestions that you may have.
> >>>>>
> >>>>> Thank you.
> >>>>>
> >>>>> Regards,
> >>>>> Edwin
> >>>>>
> >>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch> wrote:
> >>>>>>
> >>>>>> Hi Edwin
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>  1.  Sorry, the pattern was wrong, the space should preceed
the \n
> >>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>>>>>  2.  Perhaps in the data you have other (non printing) characters
> >>>>>> than \n?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> für
> >>>>>> Windows 10
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> >>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> multiple \n
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Hi Paul,
> >>>>>>
> >>>>>> We have tried this suggested regex pattern as follow:
> >>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>   <str name="fieldName">content</str>
> >>>>>>   <str name="pattern">(\n\s*){2,}</str>
> >>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>> </processor>
> >>>>>>
> >>>>>> But we still have exactly the same problem of Example 1,2 and
3
> below.
> >>>>>>
> >>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>> correctly
> >>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>
> >>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>> working
> >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n  
\n\n  3
> >>>>>> Choa
> >>>>>> Chu Kang Avenue 4, Singapore
> >>>>>> *Index content: *exalted  <br><br>Psalm 89:17  
<br><br>  <br><br>3
> >>>>>> Choa
> >>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>
> >>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>> working
> >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
  \n\n
> >>>>>> \n \n\n
> >>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
Dec
> 18,
> >>>>>> 2018
> >>>>>> at 10:07 AM
> >>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>> <br><br>On
> >>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>
> >>>>>> Any further suggestion?
> >>>>>>
> >>>>>> Thank you.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Edwin
> >>>>>>
> >>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch>
wrote:
> >>>>>>>
> >>>>>>> To avoid the «\n+\s*» matching too many \n and then failing
on the
> >>>>>> {2,}
> >>>>>>> part you could try
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> If you also want to match CRLF then
> >>>>>>>
> >>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>>>> für
> >>>>>>> Windows 10
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
> >
> >>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> multiple
> >>>>>> \n
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi Paul,
> >>>>>>>
> >>>>>>> Thanks for your reply.
> >>>>>>>
> >>>>>>> When I use this pattern:
> >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>   <str name="fieldName">content</str>
> >>>>>>>   <str name="pattern">(\n+\s*){2,}</str>
> >>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>> </processor>
> >>>>>>>
> >>>>>>> It is working for some sentence within the same content
and not
> >>>>>> working for
> >>>>>>> some sentences. Please see below for the one that is working
and
> >>>>>> another
> >>>>>>> that is not working (partially working):
> >>>>>>>
> >>>>>>> Example 1: The sentence that the above regex pattern is
working
> >>>>>> correctly
> >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>> *Index content: *    Dear Sir,  <br><br>I am
terminating
> >>>>>>>
> >>>>>>> Example 2: The sentence that the above regex pattern is
partially
> >>>>>> working
> >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
  \n\n  3
> >>>>>> Choa
> >>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
  <br><br>  <br><br>3
> >>>>>> Choa
> >>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>
> >>>>>>> Example 3: The sentence that the above regex pattern is
partially
> >>>>>> working
> >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/  
\n\n
>  \n\n
> >>>>>> \n
> >>>>>>> \n\n
> >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
Tue, Dec
> >>>>>> 18, 2018
> >>>>>>> at 10:07 AM
> >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>> <br><br>On
> >>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>
> >>>>>>> We would appreciate your help to see what is wrong?
> >>>>>>>
> >>>>>>> Thank you.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Edwin
> >>>>>>>
> >>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch>
wrote:
> >>>>>>>>
> >>>>>>>> You don’t say what happens, just that it is not working.
I assume
> >>>>>> nothing
> >>>>>>>> is replaced? Perhaps the pattern should be
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>   <str name="pattern">"(\n\s*){2,}"</str>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ??
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>>>> für
> >>>>>>>> Windows 10
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org
> >>>>>>>
> >>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
multiple
> \n
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I am trying to use the RegexReplaceProcessorFactory
to remove more
> >>>>>> than
> >>>>>>> two
> >>>>>>>> \n with any number of spaces between them (Eg: \n\n,
\n \n, \n \n
> >>>>>> \n
> >>>>>>> \n),
> >>>>>>>> and replace it with two <br>.
> >>>>>>>>
> >>>>>>>> I use the following regex pattern and it is working
when I test it
> >>>>>> in
> >>>>>>>> regex101.com. But it is not working when I put it inside
the
> >>>>>>>> RegexReplaceProcessorFactory as below:
> >>>>>>>>
> >>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>   <str name="fieldName">content</str>
> >>>>>>>>   <str name="pattern">"(\\n\s*){2,}"</str>
> >>>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>> </processor>
> >>>>>>>>          </updateRequestProcessorChain>
> >>>>>>>>
> >>>>>>>> To explain further about my regex pattern, \s* is instructing
the
> >>>>>> regex
> >>>>>>> to
> >>>>>>>> match any \n that have space after and {2,} is instructing
the
> >>>>>> regex to
> >>>>>>>> match 2 or more occurrence of such pattern (\n).
> >>>>>>>>
> >>>>>>>> Please kindly let me know what is wrong and how should
I do it?
> >>>>>>>>
> >>>>>>>> I am using Solr 7.6.0.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Edwin
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message