lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Fri, 15 Feb 2019 03:47:09 GMT
Hi,

For your info, this issue is occurring in Solr 7.7.0 as well.

Regards,
Edwin

On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
wrote:

> Hi,
>
> Should we report this as a bug in Solr?
>
> Regards,
> Edwin
>
> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>> https://regex101.com/, it is able to give us the correct result for all
>> the examples (ie: All of them will only have <br><br>, and not more than
>> that like what we are getting in Solr in our earlier examples).
>>
>> Could there be a possibility of a bug in Solr?
>>
>> Regards,
>> Edwin
>>
>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>> wrote:
>>
>>> Hi Paul,
>>>
>>> We have tried it with the space preceeding the \n i.e. <str
>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">(\s*\n){2,}</str>
>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> </processor>
>>>
>>> However, we are also getting the exact same results as the earlier
>>> Example 1, 2 and 3.
>>>
>>> As for your point 2 on perhaps in the data you have other (non printing)
>>> characters than \n, we have find that there are no non printing characters.
>>> It is just next line with a space. You can refer to the original content in
>>> the same examples below.
>>>
>>>
>>> Example 1: The sentence that the above regex pattern is working
>>> correctly
>>> *Original content in EML file:*
>>> Dear Sir,
>>>
>>>
>>> I am terminating
>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>
>>> Example 2: The sentence that the above regex pattern is partially
>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content in EML file:*
>>>
>>> *exalted*
>>>
>>> *Psalm 89:17*
>>>
>>>
>>> 3 Choa Chu Kang Avenue 4
>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>> Choa Chu Kang Avenue 4, Singapore
>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
>>> Choa Chu Kang Avenue 4, Singapore
>>>
>>> Example 3: The sentence that the above regex pattern is partially
>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content in EML file:*
>>>
>>> http://www.concordpri.moe.edu.sg/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 18, 2018 at 10:07 AM
>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018 at 10:07 AM
>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>
>>>
>>> Appreciate any other ideas or suggestions that you may have.
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch> wrote:
>>>
>>>> Hi Edwin
>>>>
>>>>
>>>>
>>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>>   2.  Perhaps in the data you have other (non printing) characters than
>>>> \n?
>>>>
>>>>
>>>>
>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> Windows 10
>>>>
>>>>
>>>>
>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>
>>>>
>>>>
>>>> Hi Paul,
>>>>
>>>> We have tried this suggested regex pattern as follow:
>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>    <str name="fieldName">content</str>
>>>>    <str name="pattern">(\n\s*){2,}</str>
>>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> </processor>
>>>>
>>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>>
>>>> Example 1: The sentence that the above regex pattern is working
>>>> correctly
>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>
>>>> Example 2: The sentence that the above regex pattern is partially
>>>> working
>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>>>> Chu Kang Avenue 4, Singapore
>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3 Choa
>>>> Chu Kang Avenue 4, Singapore
>>>>
>>>> Example 3: The sentence that the above regex pattern is partially
>>>> working
>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>>> \n\n
>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>> 2018
>>>> at 10:07 AM
>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>> <br><br>On
>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>
>>>> Any further suggestion?
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch> wrote:
>>>>
>>>> > To avoid the «\n+\s*» matching too many \n and then failing on the
>>>> {2,}
>>>> > part you could try
>>>> >
>>>> >
>>>> >
>>>> > <str name="pattern">(\n\s*){2,}</str>
>>>> >
>>>> >
>>>> >
>>>> > If you also want to match CRLF then
>>>> >
>>>> > <str name="pattern">(\r?\n\s*){2,}</str>
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
für
>>>> > Windows 10
>>>> >
>>>> >
>>>> >
>>>> > Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>>> \n
>>>> >
>>>> >
>>>> >
>>>> > Hi Paul,
>>>> >
>>>> > Thanks for your reply.
>>>> >
>>>> > When I use this pattern:
>>>> > <processor class="solr.RegexReplaceProcessorFactory">
>>>> >    <str name="fieldName">content</str>
>>>> >    <str name="pattern">(\n+\s*){2,}</str>
>>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > </processor>
>>>> >
>>>> > It is working for some sentence within the same content and not
>>>> working for
>>>> > some sentences. Please see below for the one that is working and
>>>> another
>>>> > that is not working (partially working):
>>>> >
>>>> > Example 1: The sentence that the above regex pattern is working
>>>> correctly
>>>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>>>> >
>>>> > Example 2: The sentence that the above regex pattern is partially
>>>> working
>>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>> Choa
>>>> > Chu Kang Avenue 4, Singapore
>>>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
>>>> Choa
>>>> > Chu Kang Avenue 4, Singapore
>>>> >
>>>> > Example 3: The sentence that the above regex pattern is partially
>>>> working
>>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>> \n
>>>> > \n\n
>>>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
>>>> 18, 2018
>>>> > at 10:07 AM
>>>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>> <br><br>On
>>>> > Tue, Dec 18, 2018 at 10:07 AM
>>>> >
>>>> > We would appreciate your help to see what is wrong?
>>>> >
>>>> > Thank you.
>>>> >
>>>> > Regards,
>>>> > Edwin
>>>> >
>>>> > On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch> wrote:
>>>> >
>>>> > > You don’t say what happens, just that it is not working. I assume
>>>> nothing
>>>> > > is replaced? Perhaps the pattern should be
>>>> > >
>>>> > >
>>>> > >
>>>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>>>> > >
>>>> > >
>>>> > >
>>>> > > ??
>>>> > >
>>>> > >
>>>> > >
>>>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> für
>>>> > > Windows 10
>>>> > >
>>>> > >
>>>> > >
>>>> > > Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>> > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple
\n
>>>> > >
>>>> > >
>>>> > >
>>>> > > Hi,
>>>> > >
>>>> > > I am trying to use the RegexReplaceProcessorFactory to remove more
>>>> than
>>>> > two
>>>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n
\n
>>>> \n
>>>> > \n),
>>>> > > and replace it with two <br>.
>>>> > >
>>>> > > I use the following regex pattern and it is working when I test
it
>>>> in
>>>> > > regex101.com. But it is not working when I put it inside the
>>>> > > RegexReplaceProcessorFactory as below:
>>>> > >
>>>> > > <updateRequestProcessorChain name="removeCode">
>>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >    <str name="fieldName">content</str>
>>>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > > </processor>
>>>> > >           </updateRequestProcessorChain>
>>>> > >
>>>> > > To explain further about my regex pattern, \s* is instructing the
>>>> regex
>>>> > to
>>>> > > match any \n that have space after and {2,} is instructing the
>>>> regex to
>>>> > > match 2 or more occurrence of such pattern (\n).
>>>> > >
>>>> > > Please kindly let me know what is wrong and how should I do it?
>>>> > >
>>>> > > I am using Solr 7.6.0.
>>>> > >
>>>> > > Regards,
>>>> > > Edwin
>>>> > >
>>>> >
>>>>
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message