lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Wed, 20 Feb 2019 06:59:42 GMT
Solr uses Java regex matching, so i doubt there is a bug - it would then be in the JDK. Try
out in a regex online Tool that supports Java regex for your solution.

I believe you want to have 2 regex process factories:
One that deals with single \n and one that deals with more than one \n

> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>:
> 
> Hi,
> 
> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> configuration:
> 
> <processor class="solr.RegexReplaceProcessorFactory">
>   <str name="fieldName">content</str>
>   <str name="pattern">([ \t]*\r?\n){2,}</str>
>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>   <bool name="literalReplacement">true</bool>
> </processor>
> 
> However, the issue is still occurring.
> 
> Anyone else is able to help?
> 
> Regards,
> Edwin
> 
> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> wrote:
> 
>> Hi,
>> 
>> For your info, this issue is occurring in Solr 7.7.0 as well.
>> 
>> Regards,
>> Edwin
>> 
>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> Should we report this as a bug in Solr?
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>>> wrote:
>>> 
>>>> Hi Paul,
>>>> 
>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>>>> https://regex101.com/, it is able to give us the correct result for all
>>>> the examples (ie: All of them will only have <br><br>, and not
more than
>>>> that like what we are getting in Solr in our earlier examples).
>>>> 
>>>> Could there be a possibility of a bug in Solr?
>>>> 
>>>> Regards,
>>>> Edwin
>>>> 
>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi Paul,
>>>>> 
>>>>> We have tried it with the space preceeding the \n i.e. <str
>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>>>> 
>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>   <str name="fieldName">content</str>
>>>>>   <str name="pattern">(\s*\n){2,}</str>
>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>> </processor>
>>>>> 
>>>>> However, we are also getting the exact same results as the earlier
>>>>> Example 1, 2 and 3.
>>>>> 
>>>>> As for your point 2 on perhaps in the data you have other (non
>>>>> printing) characters than \n, we have find that there are no non printing
>>>>> characters. It is just next line with a space. You can refer to the
>>>>> original content in the same examples below.
>>>>> 
>>>>> 
>>>>> Example 1: The sentence that the above regex pattern is working
>>>>> correctly
>>>>> *Original content in EML file:*
>>>>> Dear Sir,
>>>>> 
>>>>> 
>>>>> I am terminating
>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>> 
>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> *Original content in EML file:*
>>>>> 
>>>>> *exalted*
>>>>> 
>>>>> *Psalm 89:17*
>>>>> 
>>>>> 
>>>>> 3 Choa Chu Kang Avenue 4
>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>> 
>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> *Original content in EML file:*
>>>>> 
>>>>> http://www.concordpri.moe.edu.sg/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
18,
>>>>> 2018 at 10:07 AM
>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>>> 
>>>>> 
>>>>> Appreciate any other ideas or suggestions that you may have.
>>>>> 
>>>>> Thank you.
>>>>> 
>>>>> Regards,
>>>>> Edwin
>>>>> 
>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch> wrote:
>>>>>> 
>>>>>> Hi Edwin
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>>>>  2.  Perhaps in the data you have other (non printing) characters
>>>>>> than \n?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
für
>>>>>> Windows 10
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
\n
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Hi Paul,
>>>>>> 
>>>>>> We have tried this suggested regex pattern as follow:
>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>   <str name="fieldName">content</str>
>>>>>>   <str name="pattern">(\n\s*){2,}</str>
>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>> </processor>
>>>>>> 
>>>>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>>>> 
>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>> correctly
>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>> 
>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>> working
>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n
 3
>>>>>> Choa
>>>>>> Chu Kang Avenue 4, Singapore
>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
>>>>>> Choa
>>>>>> Chu Kang Avenue 4, Singapore
>>>>>> 
>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>> working
>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>>>> \n \n\n
>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
18,
>>>>>> 2018
>>>>>> at 10:07 AM
>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>> <br><br>On
>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>> 
>>>>>> Any further suggestion?
>>>>>> 
>>>>>> Thank you.
>>>>>> 
>>>>>> Regards,
>>>>>> Edwin
>>>>>> 
>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch> wrote:
>>>>>>> 
>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing
on the
>>>>>> {2,}
>>>>>>> part you could try
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> If you also want to match CRLF then
>>>>>>> 
>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>>> für
>>>>>>> Windows 10
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>>>>> \n
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hi Paul,
>>>>>>> 
>>>>>>> Thanks for your reply.
>>>>>>> 
>>>>>>> When I use this pattern:
>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>   <str name="fieldName">content</str>
>>>>>>>   <str name="pattern">(\n+\s*){2,}</str>
>>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>> </processor>
>>>>>>> 
>>>>>>> It is working for some sentence within the same content and not
>>>>>> working for
>>>>>>> some sentences. Please see below for the one that is working
and
>>>>>> another
>>>>>>> that is not working (partially working):
>>>>>>> 
>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>> correctly
>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>>> 
>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>> working
>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n
 3
>>>>>> Choa
>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
>>>>>> Choa
>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>> 
>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>> working
>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
  \n\n
>>>>>> \n
>>>>>>> \n\n
>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
Dec
>>>>>> 18, 2018
>>>>>>> at 10:07 AM
>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>> <br><br>On
>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> 
>>>>>>> We would appreciate your help to see what is wrong?
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Edwin
>>>>>>> 
>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch>
wrote:
>>>>>>>> 
>>>>>>>> You don’t say what happens, just that it is not working.
I assume
>>>>>> nothing
>>>>>>>> is replaced? Perhaps the pattern should be
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   <str name="pattern">"(\n\s*){2,}"</str>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ??
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>>> für
>>>>>>>> Windows 10
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
>>>>>>> 
>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple
\n
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove
more
>>>>>> than
>>>>>>> two
>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n,
\n \n
>>>>>> \n
>>>>>>> \n),
>>>>>>>> and replace it with two <br>.
>>>>>>>> 
>>>>>>>> I use the following regex pattern and it is working when
I test it
>>>>>> in
>>>>>>>> regex101.com. But it is not working when I put it inside
the
>>>>>>>> RegexReplaceProcessorFactory as below:
>>>>>>>> 
>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>   <str name="fieldName">content</str>
>>>>>>>>   <str name="pattern">"(\\n\s*){2,}"</str>
>>>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>> </processor>
>>>>>>>>          </updateRequestProcessorChain>
>>>>>>>> 
>>>>>>>> To explain further about my regex pattern, \s* is instructing
the
>>>>>> regex
>>>>>>> to
>>>>>>>> match any \n that have space after and {2,} is instructing
the
>>>>>> regex to
>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>>>>>>> 
>>>>>>>> Please kindly let me know what is wrong and how should I
do it?
>>>>>>>> 
>>>>>>>> I am using Solr 7.6.0.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Edwin
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 

Mime
View raw message