lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Wed, 20 Feb 2019 08:29:04 GMT
Hi Jörn ,

Do you mean the regex is not correct?

We are already using two RegexReplaceProcessorFactory steps, like the one
shown below. The output that we get is still the same.

<processor class="solr.RegexReplaceProcessorFactory">
     <str name="fieldName">content</str>
     <str name="pattern">([ \t]*\r?\n){2,}</str>
     <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
     <bool name="literalReplacement">true</bool>
<processor>

<processor class="solr.RegexReplaceProcessorFactory">
     <str name="fieldName">content</str>
     <str name="pattern">([ \t]*\r?\n){1,}</str>
     <str name="replacement">&lt;br&gt;</str>
     <bool name="literalReplacement">true</bool>
<processor>

Regards,
Edwin

On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfranke@gmail.com> wrote:

> Then you need two regexprocessfactory steps
>
> > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >:
> >
> > Hi,
> >
> > Thanks for the reply.
> >
> > Do you know of any regex online tool that works correctly for Java regex?
> > I tried to find some, but they are not working properly.
> >
> > Yes, our plan is to replace more than one \n with <br><br>, and single
\n
> > with single <br>.
> >
> > Regards,
> > Edwin
> >
> >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfranke@gmail.com> wrote:
> >>
> >> Solr uses Java regex matching, so i doubt there is a bug - it would then
> >> be in the JDK. Try out in a regex online Tool that supports Java regex
> for
> >> your solution.
> >>
> >> I believe you want to have 2 regex process factories:
> >> One that deals with single \n and one that deals with more than one \n
> >>
> >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> >>> :
> >>>
> >>> Hi,
> >>>
> >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> >>> configuration:
> >>>
> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>  <str name="fieldName">content</str>
> >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>  <bool name="literalReplacement">true</bool>
> >>> </processor>
> >>>
> >>> However, the issue is still occurring.
> >>>
> >>> Anyone else is able to help?
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Should we report this as a bug in Solr?
> >>>>>
> >>>>> Regards,
> >>>>> Edwin
> >>>>>
> >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Paul,
> >>>>>>
> >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try
in on
> >>>>>> https://regex101.com/, it is able to give us the correct result
for
> >> all
> >>>>>> the examples (ie: All of them will only have <br><br>,
and not more
> >> than
> >>>>>> that like what we are getting in Solr in our earlier examples).
> >>>>>>
> >>>>>> Could there be a possibility of a bug in Solr?
> >>>>>>
> >>>>>> Regards,
> >>>>>> Edwin
> >>>>>>
> >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> >> edwinyeozl@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Paul,
> >>>>>>>
> >>>>>>> We have tried it with the space preceeding the \n i.e. <str
> >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following
regex pattern:
> >>>>>>>
> >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>  <str name="fieldName">content</str>
> >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>> </processor>
> >>>>>>>
> >>>>>>> However, we are also getting the exact same results as the
earlier
> >>>>>>> Example 1, 2 and 3.
> >>>>>>>
> >>>>>>> As for your point 2 on perhaps in the data you have other
(non
> >>>>>>> printing) characters than \n, we have find that there are
no non
> >> printing
> >>>>>>> characters. It is just next line with a space. You can refer
to the
> >>>>>>> original content in the same examples below.
> >>>>>>>
> >>>>>>>
> >>>>>>> Example 1: The sentence that the above regex pattern is
working
> >>>>>>> correctly
> >>>>>>> *Original content in EML file:*
> >>>>>>> Dear Sir,
> >>>>>>>
> >>>>>>>
> >>>>>>> I am terminating
> >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>> *Index content: *    Dear Sir,  <br><br>I am
terminating
> >>>>>>>
> >>>>>>> Example 2: The sentence that the above regex pattern is
partially
> >>>>>>> working (as you can see, instead of 2 <br>, there
are 4 <br>)
> >>>>>>> *Original content in EML file:*
> >>>>>>>
> >>>>>>> *exalted*
> >>>>>>>
> >>>>>>> *Psalm 89:17*
> >>>>>>>
> >>>>>>>
> >>>>>>> 3 Choa Chu Kang Avenue 4
> >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
  \n\n  3
> >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
  <br><br>  <br><br>3
> >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>>>
> >>>>>>> Example 3: The sentence that the above regex pattern is
partially
> >>>>>>> working (as you can see, instead of 2 <br>, there
are 4 <br>)
> >>>>>>> *Original content in EML file:*
> >>>>>>>
> >>>>>>> http://www.concordpri.moe.edu.sg/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/  
\n\n
>  \n\n
> >> \n
> >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
 On Tue,
> >> Dec 18,
> >>>>>>> 2018 at 10:07 AM
> >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>
> >>>>>>>
> >>>>>>> Appreciate any other ideas or suggestions that you may have.
> >>>>>>>
> >>>>>>> Thank you.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Edwin
> >>>>>>>
> >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch>
wrote:
> >>>>>>>>
> >>>>>>>> Hi Edwin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed
the \n
> >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>>>>>>> 2.  Perhaps in the data you have other (non printing)
characters
> >>>>>>>> than \n?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >> für
> >>>>>>>> Windows 10
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org>
> >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
detect
> >> multiple \n
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Paul,
> >>>>>>>>
> >>>>>>>> We have tried this suggested regex pattern as follow:
> >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>  <str name="fieldName">content</str>
> >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>> </processor>
> >>>>>>>>
> >>>>>>>> But we still have exactly the same problem of Example
1,2 and 3
> >> below.
> >>>>>>>>
> >>>>>>>> Example 1: The sentence that the above regex pattern
is working
> >>>>>>>> correctly
> >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
terminating
> >>>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
> >>>>>>>>
> >>>>>>>> Example 2: The sentence that the above regex pattern
is partially
> >>>>>>>> working
> >>>>>>>> (as you can see, instead of 2 <br>, there are
4 <br>)
> >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17 
 \n\n   \n\n
> 3
> >>>>>>>> Choa
> >>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
> <br><br>3
> >>>>>>>> Choa
> >>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>>
> >>>>>>>> Example 3: The sentence that the above regex pattern
is partially
> >>>>>>>> working
> >>>>>>>> (as you can see, instead of 2 <br>, there are
4 <br>)
> >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
  \n\n
>  \n\n
> >>>>>>>> \n \n\n
> >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
 On Tue, Dec
> >> 18,
> >>>>>>>> 2018
> >>>>>>>> at 10:07 AM
> >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ 
 <br><br>
> >>>>>>>> <br><br>On
> >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>>
> >>>>>>>> Any further suggestion?
> >>>>>>>>
> >>>>>>>> Thank you.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Edwin
> >>>>>>>>
> >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch>
wrote:
> >>>>>>>>>
> >>>>>>>>> To avoid the «\n+\s*» matching too many \n and
then failing on
> the
> >>>>>>>> {2,}
> >>>>>>>>> part you could try
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> If you also want to match CRLF then
> >>>>>>>>>
> >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986
> >
> >>>>>>>> für
> >>>>>>>>> Windows 10
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org
> >>>
> >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern
to detect
> >> multiple
> >>>>>>>> \n
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi Paul,
> >>>>>>>>>
> >>>>>>>>> Thanks for your reply.
> >>>>>>>>>
> >>>>>>>>> When I use this pattern:
> >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>>  <str name="fieldName">content</str>
> >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>>> </processor>
> >>>>>>>>>
> >>>>>>>>> It is working for some sentence within the same
content and not
> >>>>>>>> working for
> >>>>>>>>> some sentences. Please see below for the one that
is working and
> >>>>>>>> another
> >>>>>>>>> that is not working (partially working):
> >>>>>>>>>
> >>>>>>>>> Example 1: The sentence that the above regex pattern
is working
> >>>>>>>> correctly
> >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I
am terminating
> >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
> >>>>>>>>>
> >>>>>>>>> Example 2: The sentence that the above regex pattern
is partially
> >>>>>>>> working
> >>>>>>>>> (as you can see, instead of 2 <br>, there
are 4 <br>)
> >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
  \n\n
>  \n\n  3
> >>>>>>>> Choa
> >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
> <br><br>3
> >>>>>>>> Choa
> >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>>>
> >>>>>>>>> Example 3: The sentence that the above regex pattern
is partially
> >>>>>>>> working
> >>>>>>>>> (as you can see, instead of 2 <br>, there
are 4 <br>)
> >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
  \n\n
> >> \n\n
> >>>>>>>> \n
> >>>>>>>>> \n\n
> >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
 On Tue,
> Dec
> >>>>>>>> 18, 2018
> >>>>>>>>> at 10:07 AM
> >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
  <br><br>
> >>>>>>>> <br><br>On
> >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>>>
> >>>>>>>>> We would appreciate your help to see what is wrong?
> >>>>>>>>>
> >>>>>>>>> Thank you.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Edwin
> >>>>>>>>>
> >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch>
wrote:
> >>>>>>>>>>
> >>>>>>>>>> You don’t say what happens, just that it is
not working. I
> assume
> >>>>>>>> nothing
> >>>>>>>>>> is replaced? Perhaps the pattern should be
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> ??
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Gesendet von Mail<
> https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>>>>>> für
> >>>>>>>>>> Windows 10
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >> solr-user@lucene.apache.org
> >>>>>>>>>
> >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern
to detect multiple
> >> \n
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory
to remove
> more
> >>>>>>>> than
> >>>>>>>>> two
> >>>>>>>>>> \n with any number of spaces between them (Eg:
\n\n, \n \n, \n
> \n
> >>>>>>>> \n
> >>>>>>>>> \n),
> >>>>>>>>>> and replace it with two <br>.
> >>>>>>>>>>
> >>>>>>>>>> I use the following regex pattern and it is
working when I test
> it
> >>>>>>>> in
> >>>>>>>>>> regex101.com. But it is not working when I put
it inside the
> >>>>>>>>>> RegexReplaceProcessorFactory as below:
> >>>>>>>>>>
> >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>>>  <str name="fieldName">content</str>
> >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>>>> </processor>
> >>>>>>>>>>         </updateRequestProcessorChain>
> >>>>>>>>>>
> >>>>>>>>>> To explain further about my regex pattern, \s*
is instructing
> the
> >>>>>>>> regex
> >>>>>>>>> to
> >>>>>>>>>> match any \n that have space after and {2,}
is instructing the
> >>>>>>>> regex to
> >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
> >>>>>>>>>>
> >>>>>>>>>> Please kindly let me know what is wrong and
how should I do it?
> >>>>>>>>>>
> >>>>>>>>>> I am using Solr 7.6.0.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Edwin
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message