lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: RegexReplaceProcessorFactory pattern to detect multiple \n
Date Wed, 20 Feb 2019 05:17:34 GMT
Hi,

We have tried with the following pattern ([ \t]*\r?\n){2,} and
configuration:

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">([ \t]*\r?\n){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

However, the issue is still occurring.

Anyone else is able to help?

Regards,
Edwin

On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
wrote:

> Hi,
>
> For your info, this issue is occurring in Solr 7.7.0 as well.
>
> Regards,
> Edwin
>
> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> wrote:
>
>> Hi,
>>
>> Should we report this as a bug in Solr?
>>
>> Regards,
>> Edwin
>>
>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>> wrote:
>>
>>> Hi Paul,
>>>
>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>>> https://regex101.com/, it is able to give us the correct result for all
>>> the examples (ie: All of them will only have <br><br>, and not more
than
>>> that like what we are getting in Solr in our earlier examples).
>>>
>>> Could there be a possibility of a bug in Solr?
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>>> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> We have tried it with the space preceeding the \n i.e. <str
>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>>>
>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>    <str name="fieldName">content</str>
>>>>    <str name="pattern">(\s*\n){2,}</str>
>>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> </processor>
>>>>
>>>> However, we are also getting the exact same results as the earlier
>>>> Example 1, 2 and 3.
>>>>
>>>> As for your point 2 on perhaps in the data you have other (non
>>>> printing) characters than \n, we have find that there are no non printing
>>>> characters. It is just next line with a space. You can refer to the
>>>> original content in the same examples below.
>>>>
>>>>
>>>> Example 1: The sentence that the above regex pattern is working
>>>> correctly
>>>> *Original content in EML file:*
>>>> Dear Sir,
>>>>
>>>>
>>>> I am terminating
>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>
>>>> Example 2: The sentence that the above regex pattern is partially
>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> *Original content in EML file:*
>>>>
>>>> *exalted*
>>>>
>>>> *Psalm 89:17*
>>>>
>>>>
>>>> 3 Choa Chu Kang Avenue 4
>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>> Choa Chu Kang Avenue 4, Singapore
>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
>>>> Choa Chu Kang Avenue 4, Singapore
>>>>
>>>> Example 3: The sentence that the above regex pattern is partially
>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> *Original content in EML file:*
>>>>
>>>> http://www.concordpri.moe.edu.sg/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>> 2018 at 10:07 AM
>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>>
>>>>
>>>> Appreciate any other ideas or suggestions that you may have.
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch> wrote:
>>>>
>>>>> Hi Edwin
>>>>>
>>>>>
>>>>>
>>>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>>>   2.  Perhaps in the data you have other (non printing) characters
>>>>> than \n?
>>>>>
>>>>>
>>>>>
>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
für
>>>>> Windows 10
>>>>>
>>>>>
>>>>>
>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
\n
>>>>>
>>>>>
>>>>>
>>>>> Hi Paul,
>>>>>
>>>>> We have tried this suggested regex pattern as follow:
>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>    <str name="fieldName">content</str>
>>>>>    <str name="pattern">(\n\s*){2,}</str>
>>>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>> </processor>
>>>>>
>>>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>>>
>>>>> Example 1: The sentence that the above regex pattern is working
>>>>> correctly
>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>
>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>> working
>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>> Choa
>>>>> Chu Kang Avenue 4, Singapore
>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
>>>>> Choa
>>>>> Chu Kang Avenue 4, Singapore
>>>>>
>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>> working
>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>>> \n \n\n
>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>>> 2018
>>>>> at 10:07 AM
>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>> <br><br>On
>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>
>>>>> Any further suggestion?
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Regards,
>>>>> Edwin
>>>>>
>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch> wrote:
>>>>>
>>>>> > To avoid the «\n+\s*» matching too many \n and then failing on
the
>>>>> {2,}
>>>>> > part you could try
>>>>> >
>>>>> >
>>>>> >
>>>>> > <str name="pattern">(\n\s*){2,}</str>
>>>>> >
>>>>> >
>>>>> >
>>>>> > If you also want to match CRLF then
>>>>> >
>>>>> > <str name="pattern">(\r?\n\s*){2,}</str>
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>> für
>>>>> > Windows 10
>>>>> >
>>>>> >
>>>>> >
>>>>> > Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>>>> \n
>>>>> >
>>>>> >
>>>>> >
>>>>> > Hi Paul,
>>>>> >
>>>>> > Thanks for your reply.
>>>>> >
>>>>> > When I use this pattern:
>>>>> > <processor class="solr.RegexReplaceProcessorFactory">
>>>>> >    <str name="fieldName">content</str>
>>>>> >    <str name="pattern">(\n+\s*){2,}</str>
>>>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>> > </processor>
>>>>> >
>>>>> > It is working for some sentence within the same content and not
>>>>> working for
>>>>> > some sentences. Please see below for the one that is working and
>>>>> another
>>>>> > that is not working (partially working):
>>>>> >
>>>>> > Example 1: The sentence that the above regex pattern is working
>>>>> correctly
>>>>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>> >
>>>>> > Example 2: The sentence that the above regex pattern is partially
>>>>> working
>>>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n
 3
>>>>> Choa
>>>>> > Chu Kang Avenue 4, Singapore
>>>>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
 <br><br>3
>>>>> Choa
>>>>> > Chu Kang Avenue 4, Singapore
>>>>> >
>>>>> > Example 3: The sentence that the above regex pattern is partially
>>>>> working
>>>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>>> \n
>>>>> > \n\n
>>>>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
>>>>> 18, 2018
>>>>> > at 10:07 AM
>>>>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>> <br><br>On
>>>>> > Tue, Dec 18, 2018 at 10:07 AM
>>>>> >
>>>>> > We would appreciate your help to see what is wrong?
>>>>> >
>>>>> > Thank you.
>>>>> >
>>>>> > Regards,
>>>>> > Edwin
>>>>> >
>>>>> > On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch> wrote:
>>>>> >
>>>>> > > You don’t say what happens, just that it is not working.
I assume
>>>>> nothing
>>>>> > > is replaced? Perhaps the pattern should be
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > ??
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>> für
>>>>> > > Windows 10
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com>
>>>>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>>> > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
>>>>> >
>>>>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple
\n
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > Hi,
>>>>> > >
>>>>> > > I am trying to use the RegexReplaceProcessorFactory to remove
more
>>>>> than
>>>>> > two
>>>>> > > \n with any number of spaces between them (Eg: \n\n, \n \n,
\n \n
>>>>> \n
>>>>> > \n),
>>>>> > > and replace it with two <br>.
>>>>> > >
>>>>> > > I use the following regex pattern and it is working when I
test it
>>>>> in
>>>>> > > regex101.com. But it is not working when I put it inside the
>>>>> > > RegexReplaceProcessorFactory as below:
>>>>> > >
>>>>> > > <updateRequestProcessorChain name="removeCode">
>>>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>>>> > >    <str name="fieldName">content</str>
>>>>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>>>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>> > > </processor>
>>>>> > >           </updateRequestProcessorChain>
>>>>> > >
>>>>> > > To explain further about my regex pattern, \s* is instructing
the
>>>>> regex
>>>>> > to
>>>>> > > match any \n that have space after and {2,} is instructing
the
>>>>> regex to
>>>>> > > match 2 or more occurrence of such pattern (\n).
>>>>> > >
>>>>> > > Please kindly let me know what is wrong and how should I do
it?
>>>>> > >
>>>>> > > I am using Solr 7.6.0.
>>>>> > >
>>>>> > > Regards,
>>>>> > > Edwin
>>>>> > >
>>>>> >
>>>>>
>>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message