lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Wed, 20 Feb 2019 08:03:12 GMT
Maybe they work properly and the regex is not as expected? 

> Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>:
> 
> Hi,
> 
> Thanks for the reply.
> 
> Do you know of any regex online tool that works correctly for Java regex?
> I tried to find some, but they are not working properly.
> 
> Yes, our plan is to replace more than one \n with <br><br>, and single \n
> with single <br>.
> 
> Regards,
> Edwin
> 
>> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfranke@gmail.com> wrote:
>> 
>> Solr uses Java regex matching, so i doubt there is a bug - it would then
>> be in the JDK. Try out in a regex online Tool that supports Java regex for
>> your solution.
>> 
>> I believe you want to have 2 regex process factories:
>> One that deals with single \n and one that deals with more than one \n
>> 
>>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> :
>>> 
>>> Hi,
>>> 
>>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>>> configuration:
>>> 
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>  <str name="fieldName">content</str>
>>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>  <bool name="literalReplacement">true</bool>
>>> </processor>
>>> 
>>> However, the issue is still occurring.
>>> 
>>> Anyone else is able to help?
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>>>> 
>>>> Regards,
>>>> Edwin
>>>> 
>>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Should we report this as a bug in Solr?
>>>>> 
>>>>> Regards,
>>>>> Edwin
>>>>> 
>>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi Paul,
>>>>>> 
>>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in
on
>>>>>> https://regex101.com/, it is able to give us the correct result for
>> all
>>>>>> the examples (ie: All of them will only have <br><br>,
and not more
>> than
>>>>>> that like what we are getting in Solr in our earlier examples).
>>>>>> 
>>>>>> Could there be a possibility of a bug in Solr?
>>>>>> 
>>>>>> Regards,
>>>>>> Edwin
>>>>>> 
>>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>> edwinyeozl@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Paul,
>>>>>>> 
>>>>>>> We have tried it with the space preceeding the \n i.e. <str
>>>>>>> name="pattern">(\s*\n){2,}</str>, with the following
regex pattern:
>>>>>>> 
>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>  <str name="fieldName">content</str>
>>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>> </processor>
>>>>>>> 
>>>>>>> However, we are also getting the exact same results as the earlier
>>>>>>> Example 1, 2 and 3.
>>>>>>> 
>>>>>>> As for your point 2 on perhaps in the data you have other (non
>>>>>>> printing) characters than \n, we have find that there are no
non
>> printing
>>>>>>> characters. It is just next line with a space. You can refer
to the
>>>>>>> original content in the same examples below.
>>>>>>> 
>>>>>>> 
>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>>> correctly
>>>>>>> *Original content in EML file:*
>>>>>>> Dear Sir,
>>>>>>> 
>>>>>>> 
>>>>>>> I am terminating
>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>>> 
>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>>> working (as you can see, instead of 2 <br>, there are 4
<br>)
>>>>>>> *Original content in EML file:*
>>>>>>> 
>>>>>>> *exalted*
>>>>>>> 
>>>>>>> *Psalm 89:17*
>>>>>>> 
>>>>>>> 
>>>>>>> 3 Choa Chu Kang Avenue 4
>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n
 3
>>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
>>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>>>> 
>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>> working (as you can see, instead of 2 <br>, there are 4
<br>)
>>>>>>> *Original content in EML file:*
>>>>>>> 
>>>>>>> http://www.concordpri.moe.edu.sg/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
  \n\n
>> \n
>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
Tue,
>> Dec 18,
>>>>>>> 2018 at 10:07 AM
>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> 
>>>>>>> 
>>>>>>> Appreciate any other ideas or suggestions that you may have.
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Edwin
>>>>>>> 
>>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch>
wrote:
>>>>>>>> 
>>>>>>>> Hi Edwin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed
the \n
>>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>>>>>> 2.  Perhaps in the data you have other (non printing) characters
>>>>>>>> than \n?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>> für
>>>>>>>> Windows 10
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> multiple \n
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Paul,
>>>>>>>> 
>>>>>>>> We have tried this suggested regex pattern as follow:
>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>  <str name="fieldName">content</str>
>>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>> </processor>
>>>>>>>> 
>>>>>>>> But we still have exactly the same problem of Example 1,2
and 3
>> below.
>>>>>>>> 
>>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>>>> correctly
>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
terminating
>>>>>>>> 
>>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
  \n\n  3
>>>>>>>> Choa
>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
  <br><br>  <br><br>3
>>>>>>>> Choa
>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>> 
>>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
  \n\n
>>>>>>>> \n \n\n
>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
Tue, Dec
>> 18,
>>>>>>>> 2018
>>>>>>>> at 10:07 AM
>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>>>> <br><br>On
>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>>>> 
>>>>>>>> Any further suggestion?
>>>>>>>> 
>>>>>>>> Thank you.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Edwin
>>>>>>>> 
>>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch>
wrote:
>>>>>>>>> 
>>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
failing on the
>>>>>>>> {2,}
>>>>>>>>> part you could try
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> If you also want to match CRLF then
>>>>>>>>> 
>>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>>>>> für
>>>>>>>>> Windows 10
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
>>> 
>>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
detect
>> multiple
>>>>>>>> \n
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi Paul,
>>>>>>>>> 
>>>>>>>>> Thanks for your reply.
>>>>>>>>> 
>>>>>>>>> When I use this pattern:
>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>>  <str name="fieldName">content</str>
>>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>>> </processor>
>>>>>>>>> 
>>>>>>>>> It is working for some sentence within the same content
and not
>>>>>>>> working for
>>>>>>>>> some sentences. Please see below for the one that is
working and
>>>>>>>> another
>>>>>>>>> that is not working (partially working):
>>>>>>>>> 
>>>>>>>>> Example 1: The sentence that the above regex pattern
is working
>>>>>>>> correctly
>>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>>>> *Index content: *    Dear Sir,  <br><br>I
am terminating
>>>>>>>>> 
>>>>>>>>> Example 2: The sentence that the above regex pattern
is partially
>>>>>>>> working
>>>>>>>>> (as you can see, instead of 2 <br>, there are 4
<br>)
>>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17  
\n\n   \n\n  3
>>>>>>>> Choa
>>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
  <br><br>  <br><br>3
>>>>>>>> Choa
>>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>>> 
>>>>>>>>> Example 3: The sentence that the above regex pattern
is partially
>>>>>>>> working
>>>>>>>>> (as you can see, instead of 2 <br>, there are 4
<br>)
>>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
  \n\n
>> \n\n
>>>>>>>> \n
>>>>>>>>> \n\n
>>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
 On Tue, Dec
>>>>>>>> 18, 2018
>>>>>>>>> at 10:07 AM
>>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/  
<br><br>
>>>>>>>> <br><br>On
>>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>>>>> 
>>>>>>>>> We would appreciate your help to see what is wrong?
>>>>>>>>> 
>>>>>>>>> Thank you.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Edwin
>>>>>>>>> 
>>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch>
wrote:
>>>>>>>>>> 
>>>>>>>>>> You don’t say what happens, just that it is not
working. I assume
>>>>>>>> nothing
>>>>>>>>>> is replaced? Perhaps the pattern should be
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ??
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>>>>> für
>>>>>>>>>> Windows 10
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> solr-user@lucene.apache.org
>>>>>>>>> 
>>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to
detect multiple
>> \n
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory
to remove more
>>>>>>>> than
>>>>>>>>> two
>>>>>>>>>> \n with any number of spaces between them (Eg: \n\n,
\n \n, \n \n
>>>>>>>> \n
>>>>>>>>> \n),
>>>>>>>>>> and replace it with two <br>.
>>>>>>>>>> 
>>>>>>>>>> I use the following regex pattern and it is working
when I test it
>>>>>>>> in
>>>>>>>>>> regex101.com. But it is not working when I put it
inside the
>>>>>>>>>> RegexReplaceProcessorFactory as below:
>>>>>>>>>> 
>>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>>>  <str name="fieldName">content</str>
>>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>>>> </processor>
>>>>>>>>>>         </updateRequestProcessorChain>
>>>>>>>>>> 
>>>>>>>>>> To explain further about my regex pattern, \s* is
instructing the
>>>>>>>> regex
>>>>>>>>> to
>>>>>>>>>> match any \n that have space after and {2,} is instructing
the
>>>>>>>> regex to
>>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>>>>>>>>> 
>>>>>>>>>> Please kindly let me know what is wrong and how should
I do it?
>>>>>>>>>> 
>>>>>>>>>> I am using Solr 7.6.0.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Edwin
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>> 

Mime
View raw message