lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Mon, 25 Feb 2019 02:28:04 GMT
Hi,

Anyone else has other suggestions or have faced the same problem?

Regards,
Edwin

On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
wrote:

> Hi Paul,
>
> If I tried to execute the second step first, then I will only get a single
> <br> for those with 2 <br>.
> For those that we originally get 4 <br>, there will be 2 <br> with a space
> in between.
>
> This is just changing the 2 <br> to be a single <br>, since the second
> step is to replace with a single <br>.
> But it has not solved the underlying problem yet.
>
> Regards,
> Edwin
>
>
> On Wed, 20 Feb 2019 at 16:41, <paul.dodd@ub.unibe.ch> wrote:
>
>> If the second step is executed first, then you will get the unwanted 4
>> <br>
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi Jörn ,
>>
>> Do you mean the regex is not correct?
>>
>> We are already using two RegexReplaceProcessorFactory steps, like the one
>> shown below. The output that we get is still the same.
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>      <str name="fieldName">content</str>
>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>      <bool name="literalReplacement">true</bool>
>> <processor>
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>      <str name="fieldName">content</str>
>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>>      <str name="replacement">&lt;br&gt;</str>
>>      <bool name="literalReplacement">true</bool>
>> <processor>
>>
>> Regards,
>> Edwin
>>
>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfranke@gmail.com> wrote:
>>
>> > Then you need two regexprocessfactory steps
>> >
>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>> edwinyeozl@gmail.com
>> > >:
>> > >
>> > > Hi,
>> > >
>> > > Thanks for the reply.
>> > >
>> > > Do you know of any regex online tool that works correctly for Java
>> regex?
>> > > I tried to find some, but they are not working properly.
>> > >
>> > > Yes, our plan is to replace more than one \n with <br><br>,
and
>> single \n
>> > > with single <br>.
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfranke@gmail.com>
>> wrote:
>> > >>
>> > >> Solr uses Java regex matching, so i doubt there is a bug - it would
>> then
>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>> regex
>> > for
>> > >> your solution.
>> > >>
>> > >> I believe you want to have 2 regex process factories:
>> > >> One that deals with single \n and one that deals with more than one
>> \n
>> > >>
>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>> > edwinyeozl@gmail.com
>> > >>> :
>> > >>>
>> > >>> Hi,
>> > >>>
>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>> > >>> configuration:
>> > >>>
>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>>  <str name="fieldName">content</str>
>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>  <bool name="literalReplacement">true</bool>
>> > >>> </processor>
>> > >>>
>> > >>> However, the issue is still occurring.
>> > >>>
>> > >>> Anyone else is able to help?
>> > >>>
>> > >>> Regards,
>> > >>> Edwin
>> > >>>
>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>> > edwinyeozl@gmail.com>
>> > >>> wrote:
>> > >>>
>> > >>>> Hi,
>> > >>>>
>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>> > >>>>
>> > >>>> Regards,
>> > >>>> Edwin
>> > >>>>
>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>> > edwinyeozl@gmail.com
>> > >>>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> Hi,
>> > >>>>>
>> > >>>>> Should we report this as a bug in Solr?
>> > >>>>>
>> > >>>>> Regards,
>> > >>>>> Edwin
>> > >>>>>
>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>> > edwinyeozl@gmail.com
>> > >>>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> Hi Paul,
>> > >>>>>>
>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using,
when we try
>> in on
>> > >>>>>> https://regex101.com/, it is able to give us the correct
result
>> for
>> > >> all
>> > >>>>>> the examples (ie: All of them will only have <br><br>,
and not
>> more
>> > >> than
>> > >>>>>> that like what we are getting in Solr in our earlier
examples).
>> > >>>>>>
>> > >>>>>> Could there be a possibility of a bug in Solr?
>> > >>>>>>
>> > >>>>>> Regards,
>> > >>>>>> Edwin
>> > >>>>>>
>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>> > >> edwinyeozl@gmail.com>
>> > >>>>>> wrote:
>> > >>>>>>
>> > >>>>>>> Hi Paul,
>> > >>>>>>>
>> > >>>>>>> We have tried it with the space preceeding the
\n i.e. <str
>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with
the following regex
>> pattern:
>> > >>>>>>>
>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>>>>>>  <str name="fieldName">content</str>
>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>>>>> </processor>
>> > >>>>>>>
>> > >>>>>>> However, we are also getting the exact same results
as the
>> earlier
>> > >>>>>>> Example 1, 2 and 3.
>> > >>>>>>>
>> > >>>>>>> As for your point 2 on perhaps in the data you
have other (non
>> > >>>>>>> printing) characters than \n, we have find that
there are no non
>> > >> printing
>> > >>>>>>> characters. It is just next line with a space.
You can refer to
>> the
>> > >>>>>>> original content in the same examples below.
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> Example 1: The sentence that the above regex pattern
is working
>> > >>>>>>> correctly
>> > >>>>>>> *Original content in EML file:*
>> > >>>>>>> Dear Sir,
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> I am terminating
>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n
I am terminating
>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
>> > >>>>>>>
>> > >>>>>>> Example 2: The sentence that the above regex pattern
is
>> partially
>> > >>>>>>> working (as you can see, instead of 2 <br>,
there are 4 <br>)
>> > >>>>>>> *Original content in EML file:*
>> > >>>>>>>
>> > >>>>>>> *exalted*
>> > >>>>>>>
>> > >>>>>>> *Psalm 89:17*
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
  \n\n
>>  \n\n  3
>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> > >>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
>> <br><br>3
>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> > >>>>>>>
>> > >>>>>>> Example 3: The sentence that the above regex pattern
is
>> partially
>> > >>>>>>> working (as you can see, instead of 2 <br>,
there are 4 <br>)
>> > >>>>>>> *Original content in EML file:*
>> > >>>>>>>
>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
  \n\n
>> >  \n\n
>> > >> \n
>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
\n\n\n  On
>> Tue,
>> > >> Dec 18,
>> > >>>>>>> 2018 at 10:07 AM
>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
  <br><br>
>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07
AM
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> Appreciate any other ideas or suggestions that
you may have.
>> > >>>>>>>
>> > >>>>>>> Thank you.
>> > >>>>>>>
>> > >>>>>>> Regards,
>> > >>>>>>> Edwin
>> > >>>>>>>
>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch>
wrote:
>> > >>>>>>>>
>> > >>>>>>>> Hi Edwin
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space
should preceed the
>> \n
>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>> > >>>>>>>> 2.  Perhaps in the data you have other (non
printing)
>> characters
>> > >>>>>>>> than \n?
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Gesendet von Mail<
>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> > >> für
>> > >>>>>>>> Windows 10
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> > solr-user@lucene.apache.org>
>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern
to detect
>> > >> multiple \n
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Hi Paul,
>> > >>>>>>>>
>> > >>>>>>>> We have tried this suggested regex pattern
as follow:
>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>>>>>>>  <str name="fieldName">content</str>
>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>>>>>> </processor>
>> > >>>>>>>>
>> > >>>>>>>> But we still have exactly the same problem
of Example 1,2 and 3
>> > >> below.
>> > >>>>>>>>
>> > >>>>>>>> Example 1: The sentence that the above regex
pattern is working
>> > >>>>>>>> correctly
>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n
I am terminating
>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
>> > >>>>>>>>
>> > >>>>>>>> Example 2: The sentence that the above regex
pattern is
>> partially
>> > >>>>>>>> working
>> > >>>>>>>> (as you can see, instead of 2 <br>, there
are 4 <br>)
>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm
89:17   \n\n
>>  \n\n
>> > 3
>> > >>>>>>>> Choa
>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
>> > <br><br>3
>> > >>>>>>>> Choa
>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>>>>>>>
>> > >>>>>>>> Example 3: The sentence that the above regex
pattern is
>> partially
>> > >>>>>>>> working
>> > >>>>>>>> (as you can see, instead of 2 <br>, there
are 4 <br>)
>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
  \n\n
>> >  \n\n
>> > >>>>>>>> \n \n\n
>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
\n\n\n  On Tue,
>> Dec
>> > >> 18,
>> > >>>>>>>> 2018
>> > >>>>>>>> at 10:07 AM
>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
  <br><br>
>> > >>>>>>>> <br><br>On
>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> > >>>>>>>>
>> > >>>>>>>> Any further suggestion?
>> > >>>>>>>>
>> > >>>>>>>> Thank you.
>> > >>>>>>>>
>> > >>>>>>>> Regards,
>> > >>>>>>>> Edwin
>> > >>>>>>>>
>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch>
wrote:
>> > >>>>>>>>>
>> > >>>>>>>>> To avoid the «\n+\s*» matching too many
\n and then failing on
>> > the
>> > >>>>>>>> {2,}
>> > >>>>>>>>> part you could try
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> If you also want to match CRLF then
>> > >>>>>>>>>
>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Gesendet von Mail<
>> https://go.microsoft.com/fwlink/?LinkId=550986
>> > >
>> > >>>>>>>> für
>> > >>>>>>>>> Windows 10
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> > solr-user@lucene.apache.org
>> > >>>
>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory
pattern to detect
>> > >> multiple
>> > >>>>>>>> \n
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Hi Paul,
>> > >>>>>>>>>
>> > >>>>>>>>> Thanks for your reply.
>> > >>>>>>>>>
>> > >>>>>>>>> When I use this pattern:
>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>>>>>>>>  <str name="fieldName">content</str>
>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>>>>>>> </processor>
>> > >>>>>>>>>
>> > >>>>>>>>> It is working for some sentence within
the same content and
>> not
>> > >>>>>>>> working for
>> > >>>>>>>>> some sentences. Please see below for the
one that is working
>> and
>> > >>>>>>>> another
>> > >>>>>>>>> that is not working (partially working):
>> > >>>>>>>>>
>> > >>>>>>>>> Example 1: The sentence that the above
regex pattern is
>> working
>> > >>>>>>>> correctly
>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n
\n \n\n I am
>> terminating
>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
>> > >>>>>>>>>
>> > >>>>>>>>> Example 2: The sentence that the above
regex pattern is
>> partially
>> > >>>>>>>> working
>> > >>>>>>>>> (as you can see, instead of 2 <br>,
there are 4 <br>)
>> > >>>>>>>>> *Original content:* exalted  \n \n\n  
Psalm 89:17   \n\n
>> >  \n\n  3
>> > >>>>>>>> Choa
>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm
89:17   <br><br>
>> > <br><br>3
>> > >>>>>>>> Choa
>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>>>>>>>>
>> > >>>>>>>>> Example 3: The sentence that the above
regex pattern is
>> partially
>> > >>>>>>>> working
>> > >>>>>>>>> (as you can see, instead of 2 <br>,
there are 4 <br>)
>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
  \n\n
>> > >> \n\n
>> > >>>>>>>> \n
>> > >>>>>>>>> \n\n
>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n
\n\n\n \n\n\n  On Tue,
>> > Dec
>> > >>>>>>>> 18, 2018
>> > >>>>>>>>> at 10:07 AM
>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
  <br><br>
>> > >>>>>>>> <br><br>On
>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> > >>>>>>>>>
>> > >>>>>>>>> We would appreciate your help to see what
is wrong?
>> > >>>>>>>>>
>> > >>>>>>>>> Thank you.
>> > >>>>>>>>>
>> > >>>>>>>>> Regards,
>> > >>>>>>>>> Edwin
>> > >>>>>>>>>
>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch>
wrote:
>> > >>>>>>>>>>
>> > >>>>>>>>>> You don’t say what happens, just
that it is not working. I
>> > assume
>> > >>>>>>>> nothing
>> > >>>>>>>>>> is replaced? Perhaps the pattern should
be
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> ??
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Gesendet von Mail<
>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>> > >>>>>>>> für
>> > >>>>>>>>>> Windows 10
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019
14:08
>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> > >> solr-user@lucene.apache.org
>> > >>>>>>>>>
>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory
pattern to detect
>> multiple
>> > >> \n
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Hi,
>> > >>>>>>>>>>
>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory
to remove
>> > more
>> > >>>>>>>> than
>> > >>>>>>>>> two
>> > >>>>>>>>>> \n with any number of spaces between
them (Eg: \n\n, \n \n,
>> \n
>> > \n
>> > >>>>>>>> \n
>> > >>>>>>>>> \n),
>> > >>>>>>>>>> and replace it with two <br>.
>> > >>>>>>>>>>
>> > >>>>>>>>>> I use the following regex pattern and
it is working when I
>> test
>> > it
>> > >>>>>>>> in
>> > >>>>>>>>>> regex101.com. But it is not working
when I put it inside the
>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>> > >>>>>>>>>>
>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>>>>>>>>>  <str name="fieldName">content</str>
>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>>>>>>>> </processor>
>> > >>>>>>>>>>         </updateRequestProcessorChain>
>> > >>>>>>>>>>
>> > >>>>>>>>>> To explain further about my regex pattern,
\s* is instructing
>> > the
>> > >>>>>>>> regex
>> > >>>>>>>>> to
>> > >>>>>>>>>> match any \n that have space after
and {2,} is instructing
>> the
>> > >>>>>>>> regex to
>> > >>>>>>>>>> match 2 or more occurrence of such
pattern (\n).
>> > >>>>>>>>>>
>> > >>>>>>>>>> Please kindly let me know what is wrong
and how should I do
>> it?
>> > >>>>>>>>>>
>> > >>>>>>>>>> I am using Solr 7.6.0.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Regards,
>> > >>>>>>>>>> Edwin
>> > >>>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>
>> > >>
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message