lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Mon, 11 Feb 2019 16:10:07 GMT
Hi,

Should we report this as a bug in Solr?

Regards,
Edwin

On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
wrote:

> Hi Paul,
>
> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> https://regex101.com/, it is able to give us the correct result for all
> the examples (ie: All of them will only have <br><br>, and not more than
> that like what we are getting in Solr in our earlier examples).
>
> Could there be a possibility of a bug in Solr?
>
> Regards,
> Edwin
>
> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> We have tried it with the space preceeding the \n i.e. <str
>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(\s*\n){2,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> </processor>
>>
>> However, we are also getting the exact same results as the earlier
>> Example 1, 2 and 3.
>>
>> As for your point 2 on perhaps in the data you have other (non printing)
>> characters than \n, we have find that there are no non printing characters.
>> It is just next line with a space. You can refer to the original content in
>> the same examples below.
>>
>>
>> Example 1: The sentence that the above regex pattern is working correctly
>> *Original content in EML file:*
>> Dear Sir,
>>
>>
>> I am terminating
>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>
>> Example 2: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content in EML file:*
>>
>> *exalted*
>>
>> *Psalm 89:17*
>>
>>
>> 3 Choa Chu Kang Avenue 4
>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> Chu Kang Avenue 4, Singapore
>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3 Choa
>> Chu Kang Avenue 4, Singapore
>>
>> Example 3: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content in EML file:*
>>
>> http://www.concordpri.moe.edu.sg/
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 18, 2018 at 10:07 AM
>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018 at 10:07 AM
>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>
>>
>> Appreciate any other ideas or suggestions that you may have.
>>
>> Thank you.
>>
>> Regards,
>> Edwin
>>
>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch> wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
>>> <str name="pattern">(\s*\n){2,}</str>
>>>   2.  Perhaps in the data you have other (non printing) characters than
>>> \n?
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Paul,
>>>
>>> We have tried this suggested regex pattern as follow:
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">(\n\s*){2,}</str>
>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> </processor>
>>>
>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>
>>> Example 1: The sentence that the above regex pattern is working correctly
>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>
>>> Example 2: The sentence that the above regex pattern is partially working
>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>>> Chu Kang Avenue 4, Singapore
>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3 Choa
>>> Chu Kang Avenue 4, Singapore
>>>
>>> Example 3: The sentence that the above regex pattern is partially working
>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> \n\n
>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018
>>> at 10:07 AM
>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> <br><br>On
>>> Tue, Dec 18, 2018 at 10:07 AM
>>>
>>> Any further suggestion?
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch> wrote:
>>>
>>> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
>>> > part you could try
>>> >
>>> >
>>> >
>>> > <str name="pattern">(\n\s*){2,}</str>
>>> >
>>> >
>>> >
>>> > If you also want to match CRLF then
>>> >
>>> > <str name="pattern">(\r?\n\s*){2,}</str>
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
für
>>> > Windows 10
>>> >
>>> >
>>> >
>>> > Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>> >
>>> >
>>> >
>>> > Hi Paul,
>>> >
>>> > Thanks for your reply.
>>> >
>>> > When I use this pattern:
>>> > <processor class="solr.RegexReplaceProcessorFactory">
>>> >    <str name="fieldName">content</str>
>>> >    <str name="pattern">(\n+\s*){2,}</str>
>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > </processor>
>>> >
>>> > It is working for some sentence within the same content and not
>>> working for
>>> > some sentences. Please see below for the one that is working and
>>> another
>>> > that is not working (partially working):
>>> >
>>> > Example 1: The sentence that the above regex pattern is working
>>> correctly
>>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >
>>> > Example 2: The sentence that the above regex pattern is partially
>>> working
>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>> Choa
>>> > Chu Kang Avenue 4, Singapore
>>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
>>> Choa
>>> > Chu Kang Avenue 4, Singapore
>>> >
>>> > Example 3: The sentence that the above regex pattern is partially
>>> working
>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> > \n\n
>>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018
>>> > at 10:07 AM
>>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> <br><br>On
>>> > Tue, Dec 18, 2018 at 10:07 AM
>>> >
>>> > We would appreciate your help to see what is wrong?
>>> >
>>> > Thank you.
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> > On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch> wrote:
>>> >
>>> > > You don’t say what happens, just that it is not working. I assume
>>> nothing
>>> > > is replaced? Perhaps the pattern should be
>>> > >
>>> > >
>>> > >
>>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>>> > >
>>> > >
>>> > >
>>> > > ??
>>> > >
>>> > >
>>> > >
>>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>> für
>>> > > Windows 10
>>> > >
>>> > >
>>> > >
>>> > > Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>>> > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>>> > >
>>> > >
>>> > >
>>> > > Hi,
>>> > >
>>> > > I am trying to use the RegexReplaceProcessorFactory to remove more
>>> than
>>> > two
>>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n 
\n
>>> > \n),
>>> > > and replace it with two <br>.
>>> > >
>>> > > I use the following regex pattern and it is working when I test it
in
>>> > > regex101.com. But it is not working when I put it inside the
>>> > > RegexReplaceProcessorFactory as below:
>>> > >
>>> > > <updateRequestProcessorChain name="removeCode">
>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>> > >    <str name="fieldName">content</str>
>>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > > </processor>
>>> > >           </updateRequestProcessorChain>
>>> > >
>>> > > To explain further about my regex pattern, \s* is instructing the
>>> regex
>>> > to
>>> > > match any \n that have space after and {2,} is instructing the regex
>>> to
>>> > > match 2 or more occurrence of such pattern (\n).
>>> > >
>>> > > Please kindly let me know what is wrong and how should I do it?
>>> > >
>>> > > I am using Solr 7.6.0.
>>> > >
>>> > > Regards,
>>> > > Edwin
>>> > >
>>> >
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message