lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Wed, 20 Feb 2019 08:19:18 GMT
Hi Paul,

I am using Java 1.8.0_201.

Regards,
Edwin

On Wed, 20 Feb 2019 at 16:01, <paul.dodd@ub.unibe.ch> wrote:

> BTW, which Java Version are you using?
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> Gesendet: Mittwoch, 20. Februar 2019 08:13
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> Thanks for the reply.
>
> Do you know of any regex online tool that works correctly for Java regex?
> I tried to find some, but they are not working properly.
>
> Yes, our plan is to replace more than one \n with <br><br>, and single \n
> with single <br>.
>
> Regards,
> Edwin
>
> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfranke@gmail.com> wrote:
>
> > Solr uses Java regex matching, so i doubt there is a bug - it would then
> > be in the JDK. Try out in a regex online Tool that supports Java regex
> for
> > your solution.
> >
> > I believe you want to have 2 regex process factories:
> > One that deals with single \n and one that deals with more than one \n
> >
> > > Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> > >:
> > >
> > > Hi,
> > >
> > > We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > > configuration:
> > >
> > > <processor class="solr.RegexReplaceProcessorFactory">
> > >   <str name="fieldName">content</str>
> > >   <str name="pattern">([ \t]*\r?\n){2,}</str>
> > >   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >   <bool name="literalReplacement">true</bool>
> > > </processor>
> > >
> > > However, the issue is still occurring.
> > >
> > > Anyone else is able to help?
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> For your info, this issue is occurring in Solr 7.7.0 as well.
> > >>
> > >> Regards,
> > >> Edwin
> > >>
> > >> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Should we report this as a bug in Solr?
> > >>>
> > >>> Regards,
> > >>> Edwin
> > >>>
> > >>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> Hi Paul,
> > >>>>
> > >>>> Regarding the regex (\n\s*){2,} that we are using, when we try
in on
> > >>>> https://regex101.com/, it is able to give us the correct result
for
> > all
> > >>>> the examples (ie: All of them will only have <br><br>,
and not more
> > than
> > >>>> that like what we are getting in Solr in our earlier examples).
> > >>>>
> > >>>> Could there be a possibility of a bug in Solr?
> > >>>>
> > >>>> Regards,
> > >>>> Edwin
> > >>>>
> > >>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Paul,
> > >>>>>
> > >>>>> We have tried it with the space preceeding the \n i.e. <str
> > >>>>> name="pattern">(\s*\n){2,}</str>, with the following
regex pattern:
> > >>>>>
> > >>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>   <str name="fieldName">content</str>
> > >>>>>   <str name="pattern">(\s*\n){2,}</str>
> > >>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>> </processor>
> > >>>>>
> > >>>>> However, we are also getting the exact same results as the
earlier
> > >>>>> Example 1, 2 and 3.
> > >>>>>
> > >>>>> As for your point 2 on perhaps in the data you have other (non
> > >>>>> printing) characters than \n, we have find that there are no
non
> > printing
> > >>>>> characters. It is just next line with a space. You can refer
to the
> > >>>>> original content in the same examples below.
> > >>>>>
> > >>>>>
> > >>>>> Example 1: The sentence that the above regex pattern is working
> > >>>>> correctly
> > >>>>> *Original content in EML file:*
> > >>>>> Dear Sir,
> > >>>>>
> > >>>>>
> > >>>>> I am terminating
> > >>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > >>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> > >>>>>
> > >>>>> Example 2: The sentence that the above regex pattern is partially
> > >>>>> working (as you can see, instead of 2 <br>, there are
4 <br>)
> > >>>>> *Original content in EML file:*
> > >>>>>
> > >>>>> *exalted*
> > >>>>>
> > >>>>> *Psalm 89:17*
> > >>>>>
> > >>>>>
> > >>>>> 3 Choa Chu Kang Avenue 4
> > >>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n 
 \n\n  3
> > >>>>> Choa Chu Kang Avenue 4, Singapore
> > >>>>> *Index content: *exalted  <br><br>Psalm 89:17 
 <br><br>  <br><br>3
> > >>>>> Choa Chu Kang Avenue 4, Singapore
> > >>>>>
> > >>>>> Example 3: The sentence that the above regex pattern is partially
> > >>>>> working (as you can see, instead of 2 <br>, there are
4 <br>)
> > >>>>> *Original content in EML file:*
> > >>>>>
> > >>>>> http://www.concordpri.moe.edu.sg/
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Tue, Dec 18, 2018 at 10:07 AM
> > >>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>  \n\n
> > \n
> > >>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
 On Tue,
> > Dec 18,
> > >>>>> 2018 at 10:07 AM
> > >>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> > >>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>
> > >>>>>
> > >>>>> Appreciate any other ideas or suggestions that you may have.
> > >>>>>
> > >>>>> Thank you.
> > >>>>>
> > >>>>> Regards,
> > >>>>> Edwin
> > >>>>>
> > >>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch>
wrote:
> > >>>>>>
> > >>>>>> Hi Edwin
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>  1.  Sorry, the pattern was wrong, the space should preceed
the \n
> > >>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> > >>>>>>  2.  Perhaps in the data you have other (non printing)
characters
> > >>>>>> than \n?
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> > für
> > >>>>>> Windows 10
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> > >>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> > >>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org>
> > >>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> > multiple \n
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Hi Paul,
> > >>>>>>
> > >>>>>> We have tried this suggested regex pattern as follow:
> > >>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>   <str name="fieldName">content</str>
> > >>>>>>   <str name="pattern">(\n\s*){2,}</str>
> > >>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>> </processor>
> > >>>>>>
> > >>>>>> But we still have exactly the same problem of Example 1,2
and 3
> > below.
> > >>>>>>
> > >>>>>> Example 1: The sentence that the above regex pattern is
working
> > >>>>>> correctly
> > >>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > >>>>>> *Index content: *    Dear Sir,  <br><br>I am
terminating
> > >>>>>>
> > >>>>>> Example 2: The sentence that the above regex pattern is
partially
> > >>>>>> working
> > >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
  \n\n
> 3
> > >>>>>> Choa
> > >>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>> *Index content: *exalted  <br><br>Psalm 89:17
  <br><br>
> <br><br>3
> > >>>>>> Choa
> > >>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>
> > >>>>>> Example 3: The sentence that the above regex pattern is
partially
> > >>>>>> working
> > >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ 
 \n\n
>  \n\n
> > >>>>>> \n \n\n
> > >>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n 
On Tue, Dec
> > 18,
> > >>>>>> 2018
> > >>>>>> at 10:07 AM
> > >>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> > >>>>>> <br><br>On
> > >>>>>> Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>
> > >>>>>> Any further suggestion?
> > >>>>>>
> > >>>>>> Thank you.
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Edwin
> > >>>>>>
> > >>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch>
wrote:
> > >>>>>>>
> > >>>>>>> To avoid the «\n+\s*» matching too many \n and then
failing on
> the
> > >>>>>> {2,}
> > >>>>>>> part you could try
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> <str name="pattern">(\n\s*){2,}</str>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> If you also want to match CRLF then
> > >>>>>>>
> > >>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986
> >
> > >>>>>> für
> > >>>>>>> Windows 10
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> > >>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> > >>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org
> > >
> > >>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
detect
> > multiple
> > >>>>>> \n
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Hi Paul,
> > >>>>>>>
> > >>>>>>> Thanks for your reply.
> > >>>>>>>
> > >>>>>>> When I use this pattern:
> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>   <str name="fieldName">content</str>
> > >>>>>>>   <str name="pattern">(\n+\s*){2,}</str>
> > >>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>> </processor>
> > >>>>>>>
> > >>>>>>> It is working for some sentence within the same content
and not
> > >>>>>> working for
> > >>>>>>> some sentences. Please see below for the one that is
working and
> > >>>>>> another
> > >>>>>>> that is not working (partially working):
> > >>>>>>>
> > >>>>>>> Example 1: The sentence that the above regex pattern
is working
> > >>>>>> correctly
> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
terminating
> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
> > >>>>>>>
> > >>>>>>> Example 2: The sentence that the above regex pattern
is partially
> > >>>>>> working
> > >>>>>>> (as you can see, instead of 2 <br>, there are
4 <br>)
> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
  \n\n
>  \n\n  3
> > >>>>>> Choa
> > >>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
> <br><br>3
> > >>>>>> Choa
> > >>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>>
> > >>>>>>> Example 3: The sentence that the above regex pattern
is partially
> > >>>>>> working
> > >>>>>>> (as you can see, instead of 2 <br>, there are
4 <br>)
> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
  \n\n
> >  \n\n
> > >>>>>> \n
> > >>>>>>> \n\n
> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
 On Tue,
> Dec
> > >>>>>> 18, 2018
> > >>>>>>> at 10:07 AM
> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
  <br><br>
> > >>>>>> <br><br>On
> > >>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>>
> > >>>>>>> We would appreciate your help to see what is wrong?
> > >>>>>>>
> > >>>>>>> Thank you.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Edwin
> > >>>>>>>
> > >>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch>
wrote:
> > >>>>>>>>
> > >>>>>>>> You don’t say what happens, just that it is not
working. I
> assume
> > >>>>>> nothing
> > >>>>>>>> is replaced? Perhaps the pattern should be
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>   <str name="pattern">"(\n\s*){2,}"</str>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> ??
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Gesendet von Mail<
> https://go.microsoft.com/fwlink/?LinkId=550986>
> > >>>>>> für
> > >>>>>>>> Windows 10
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > solr-user@lucene.apache.org
> > >>>>>>>
> > >>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to
detect multiple
> > \n
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> I am trying to use the RegexReplaceProcessorFactory
to remove
> more
> > >>>>>> than
> > >>>>>>> two
> > >>>>>>>> \n with any number of spaces between them (Eg:
\n\n, \n \n, \n
> \n
> > >>>>>> \n
> > >>>>>>> \n),
> > >>>>>>>> and replace it with two <br>.
> > >>>>>>>>
> > >>>>>>>> I use the following regex pattern and it is working
when I test
> it
> > >>>>>> in
> > >>>>>>>> regex101.com. But it is not working when I put
it inside the
> > >>>>>>>> RegexReplaceProcessorFactory as below:
> > >>>>>>>>
> > >>>>>>>> <updateRequestProcessorChain name="removeCode">
> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>>   <str name="fieldName">content</str>
> > >>>>>>>>   <str name="pattern">"(\\n\s*){2,}"</str>
> > >>>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>>> </processor>
> > >>>>>>>>          </updateRequestProcessorChain>
> > >>>>>>>>
> > >>>>>>>> To explain further about my regex pattern, \s*
is instructing
> the
> > >>>>>> regex
> > >>>>>>> to
> > >>>>>>>> match any \n that have space after and {2,} is
instructing the
> > >>>>>> regex to
> > >>>>>>>> match 2 or more occurrence of such pattern (\n).
> > >>>>>>>>
> > >>>>>>>> Please kindly let me know what is wrong and how
should I do it?
> > >>>>>>>>
> > >>>>>>>> I am using Solr 7.6.0.
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Edwin
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message