lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Wed, 20 Feb 2019 08:58:52 GMT
Hi Paul,

If I tried to execute the second step first, then I will only get a single
<br> for those with 2 <br>.
For those that we originally get 4 <br>, there will be 2 <br> with a space
in between.

This is just changing the 2 <br> to be a single <br>, since the second step
is to replace with a single <br>.
But it has not solved the underlying problem yet.

Regards,
Edwin


On Wed, 20 Feb 2019 at 16:41, <paul.dodd@ub.unibe.ch> wrote:

> If the second step is executed first, then you will get the unwanted 4 <br>
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> Gesendet: Mittwoch, 20. Februar 2019 09:29
> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Jörn ,
>
> Do you mean the regex is not correct?
>
> We are already using two RegexReplaceProcessorFactory steps, like the one
> shown below. The output that we get is still the same.
>
> <processor class="solr.RegexReplaceProcessorFactory">
>      <str name="fieldName">content</str>
>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>      <bool name="literalReplacement">true</bool>
> <processor>
>
> <processor class="solr.RegexReplaceProcessorFactory">
>      <str name="fieldName">content</str>
>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>      <str name="replacement">&lt;br&gt;</str>
>      <bool name="literalReplacement">true</bool>
> <processor>
>
> Regards,
> Edwin
>
> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfranke@gmail.com> wrote:
>
> > Then you need two regexprocessfactory steps
> >
> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> > >:
> > >
> > > Hi,
> > >
> > > Thanks for the reply.
> > >
> > > Do you know of any regex online tool that works correctly for Java
> regex?
> > > I tried to find some, but they are not working properly.
> > >
> > > Yes, our plan is to replace more than one \n with <br><br>, and
single
> \n
> > > with single <br>.
> > >
> > > Regards,
> > > Edwin
> > >
> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfranke@gmail.com>
> wrote:
> > >>
> > >> Solr uses Java regex matching, so i doubt there is a bug - it would
> then
> > >> be in the JDK. Try out in a regex online Tool that supports Java regex
> > for
> > >> your solution.
> > >>
> > >> I believe you want to have 2 regex process factories:
> > >> One that deals with single \n and one that deals with more than one \n
> > >>
> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com
> > >>> :
> > >>>
> > >>> Hi,
> > >>>
> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > >>> configuration:
> > >>>
> > >>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>  <str name="fieldName">content</str>
> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>  <bool name="literalReplacement">true</bool>
> > >>> </processor>
> > >>>
> > >>> However, the issue is still occurring.
> > >>>
> > >>> Anyone else is able to help?
> > >>>
> > >>> Regards,
> > >>> Edwin
> > >>>
> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
> > >>>>
> > >>>> Regards,
> > >>>> Edwin
> > >>>>
> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com
> > >>>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> Should we report this as a bug in Solr?
> > >>>>>
> > >>>>> Regards,
> > >>>>> Edwin
> > >>>>>
> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Paul,
> > >>>>>>
> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when
we try in
> on
> > >>>>>> https://regex101.com/, it is able to give us the correct
result
> for
> > >> all
> > >>>>>> the examples (ie: All of them will only have <br><br>,
and not
> more
> > >> than
> > >>>>>> that like what we are getting in Solr in our earlier examples).
> > >>>>>>
> > >>>>>> Could there be a possibility of a bug in Solr?
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Edwin
> > >>>>>>
> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> > >> edwinyeozl@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Paul,
> > >>>>>>>
> > >>>>>>> We have tried it with the space preceeding the \n i.e.
<str
> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the
following regex
> pattern:
> > >>>>>>>
> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>  <str name="fieldName">content</str>
> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>> </processor>
> > >>>>>>>
> > >>>>>>> However, we are also getting the exact same results
as the
> earlier
> > >>>>>>> Example 1, 2 and 3.
> > >>>>>>>
> > >>>>>>> As for your point 2 on perhaps in the data you have
other (non
> > >>>>>>> printing) characters than \n, we have find that there
are no non
> > >> printing
> > >>>>>>> characters. It is just next line with a space. You
can refer to
> the
> > >>>>>>> original content in the same examples below.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Example 1: The sentence that the above regex pattern
is working
> > >>>>>>> correctly
> > >>>>>>> *Original content in EML file:*
> > >>>>>>> Dear Sir,
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> I am terminating
> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
terminating
> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
> > >>>>>>>
> > >>>>>>> Example 2: The sentence that the above regex pattern
is partially
> > >>>>>>> working (as you can see, instead of 2 <br>, there
are 4 <br>)
> > >>>>>>> *Original content in EML file:*
> > >>>>>>>
> > >>>>>>> *exalted*
> > >>>>>>>
> > >>>>>>> *Psalm 89:17*
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> 3 Choa Chu Kang Avenue 4
> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
  \n\n
>  \n\n  3
> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> > >>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
> <br><br>3
> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> > >>>>>>>
> > >>>>>>> Example 3: The sentence that the above regex pattern
is partially
> > >>>>>>> working (as you can see, instead of 2 <br>, there
are 4 <br>)
> > >>>>>>> *Original content in EML file:*
> > >>>>>>>
> > >>>>>>> http://www.concordpri.moe.edu.sg/
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
  \n\n
> >  \n\n
> > >> \n
> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
\n\n\n  On
> Tue,
> > >> Dec 18,
> > >>>>>>> 2018 at 10:07 AM
> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
  <br><br>
> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Appreciate any other ideas or suggestions that you
may have.
> > >>>>>>>
> > >>>>>>> Thank you.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Edwin
> > >>>>>>>
> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch>
wrote:
> > >>>>>>>>
> > >>>>>>>> Hi Edwin
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should
preceed the
> \n
> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
characters
> > >>>>>>>> than \n?
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Gesendet von Mail<
> https://go.microsoft.com/fwlink/?LinkId=550986>
> > >> für
> > >>>>>>>> Windows 10
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > solr-user@lucene.apache.org>
> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern
to detect
> > >> multiple \n
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Hi Paul,
> > >>>>>>>>
> > >>>>>>>> We have tried this suggested regex pattern as follow:
> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>>  <str name="fieldName">content</str>
> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>>> </processor>
> > >>>>>>>>
> > >>>>>>>> But we still have exactly the same problem of Example
1,2 and 3
> > >> below.
> > >>>>>>>>
> > >>>>>>>> Example 1: The sentence that the above regex pattern
is working
> > >>>>>>>> correctly
> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n
I am terminating
> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
> > >>>>>>>>
> > >>>>>>>> Example 2: The sentence that the above regex pattern
is
> partially
> > >>>>>>>> working
> > >>>>>>>> (as you can see, instead of 2 <br>, there
are 4 <br>)
> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
  \n\n   \n\n
> > 3
> > >>>>>>>> Choa
> > >>>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
> > <br><br>3
> > >>>>>>>> Choa
> > >>>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>>>
> > >>>>>>>> Example 3: The sentence that the above regex pattern
is
> partially
> > >>>>>>>> working
> > >>>>>>>> (as you can see, instead of 2 <br>, there
are 4 <br>)
> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
  \n\n
> >  \n\n
> > >>>>>>>> \n \n\n
> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
\n\n\n  On Tue,
> Dec
> > >> 18,
> > >>>>>>>> 2018
> > >>>>>>>> at 10:07 AM
> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
  <br><br>
> > >>>>>>>> <br><br>On
> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>>>
> > >>>>>>>> Any further suggestion?
> > >>>>>>>>
> > >>>>>>>> Thank you.
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Edwin
> > >>>>>>>>
> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch>
wrote:
> > >>>>>>>>>
> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n
and then failing on
> > the
> > >>>>>>>> {2,}
> > >>>>>>>>> part you could try
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> If you also want to match CRLF then
> > >>>>>>>>>
> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Gesendet von Mail<
> https://go.microsoft.com/fwlink/?LinkId=550986
> > >
> > >>>>>>>> für
> > >>>>>>>>> Windows 10
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > solr-user@lucene.apache.org
> > >>>
> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern
to detect
> > >> multiple
> > >>>>>>>> \n
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Hi Paul,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for your reply.
> > >>>>>>>>>
> > >>>>>>>>> When I use this pattern:
> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>>>  <str name="fieldName">content</str>
> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>>>> </processor>
> > >>>>>>>>>
> > >>>>>>>>> It is working for some sentence within the
same content and not
> > >>>>>>>> working for
> > >>>>>>>>> some sentences. Please see below for the one
that is working
> and
> > >>>>>>>> another
> > >>>>>>>>> that is not working (partially working):
> > >>>>>>>>>
> > >>>>>>>>> Example 1: The sentence that the above regex
pattern is working
> > >>>>>>>> correctly
> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n
I am terminating
> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
> > >>>>>>>>>
> > >>>>>>>>> Example 2: The sentence that the above regex
pattern is
> partially
> > >>>>>>>> working
> > >>>>>>>>> (as you can see, instead of 2 <br>, there
are 4 <br>)
> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm
89:17   \n\n
> >  \n\n  3
> > >>>>>>>> Choa
> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
> > <br><br>3
> > >>>>>>>> Choa
> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>>>>
> > >>>>>>>>> Example 3: The sentence that the above regex
pattern is
> partially
> > >>>>>>>> working
> > >>>>>>>>> (as you can see, instead of 2 <br>, there
are 4 <br>)
> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
  \n\n
> > >> \n\n
> > >>>>>>>> \n
> > >>>>>>>>> \n\n
> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
\n\n\n  On Tue,
> > Dec
> > >>>>>>>> 18, 2018
> > >>>>>>>>> at 10:07 AM
> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
  <br><br>
> > >>>>>>>> <br><br>On
> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>>>>
> > >>>>>>>>> We would appreciate your help to see what is
wrong?
> > >>>>>>>>>
> > >>>>>>>>> Thank you.
> > >>>>>>>>>
> > >>>>>>>>> Regards,
> > >>>>>>>>> Edwin
> > >>>>>>>>>
> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch>
wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> You don’t say what happens, just that
it is not working. I
> > assume
> > >>>>>>>> nothing
> > >>>>>>>>>> is replaced? Perhaps the pattern should
be
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> ??
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Gesendet von Mail<
> > https://go.microsoft.com/fwlink/?LinkId=550986>
> > >>>>>>>> für
> > >>>>>>>>>> Windows 10
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > >> solr-user@lucene.apache.org
> > >>>>>>>>>
> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern
to detect
> multiple
> > >> \n
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory
to remove
> > more
> > >>>>>>>> than
> > >>>>>>>>> two
> > >>>>>>>>>> \n with any number of spaces between them
(Eg: \n\n, \n \n, \n
> > \n
> > >>>>>>>> \n
> > >>>>>>>>> \n),
> > >>>>>>>>>> and replace it with two <br>.
> > >>>>>>>>>>
> > >>>>>>>>>> I use the following regex pattern and it
is working when I
> test
> > it
> > >>>>>>>> in
> > >>>>>>>>>> regex101.com. But it is not working when
I put it inside the
> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
> > >>>>>>>>>>
> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>>>>  <str name="fieldName">content</str>
> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>>>>> </processor>
> > >>>>>>>>>>         </updateRequestProcessorChain>
> > >>>>>>>>>>
> > >>>>>>>>>> To explain further about my regex pattern,
\s* is instructing
> > the
> > >>>>>>>> regex
> > >>>>>>>>> to
> > >>>>>>>>>> match any \n that have space after and
{2,} is instructing the
> > >>>>>>>> regex to
> > >>>>>>>>>> match 2 or more occurrence of such pattern
(\n).
> > >>>>>>>>>>
> > >>>>>>>>>> Please kindly let me know what is wrong
and how should I do
> it?
> > >>>>>>>>>>
> > >>>>>>>>>> I am using Solr 7.6.0.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards,
> > >>>>>>>>>> Edwin
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message