lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: WhitespaceTokenizer to consider incorrectly encoded c2a0?
Date Wed, 08 Oct 2014 14:16:02 GMT
Alexandre - i am sorry if i was not clear, this is about queries, this all happens at query
time. Yes we can do the substitution in with the regex replace filter, but i would propose
this weird exception to be added to WhitespaceTokenizer so Lucene deals with this by itself.

Markus
 
-----Original message-----
> From:Alexandre Rafalovitch <arafalov@gmail.com>
> Sent: Wednesday 8th October 2014 16:12
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: WhitespaceTokenizer to consider incorrectly encoded c2a0?
> 
> Is this a suggestion for JIRA ticket? Or a question on how to solve
> it? If the later, you could probably stick a RegEx replacement in the
> UpdateRequestProcessor chain and be done with it.
> 
> As to why? I would look for the rest of the MSWord-generated
> artifacts, such as "smart" quotes, extra-long dashes, etc.
> 
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
> 
> 
> On 8 October 2014 09:59, Markus Jelsma <markus.jelsma@openindex.io> wrote:
> > Hi,
> >
> > For some crazy reason, some users somehow manage to substitute a perfectly normal
space with a badly encoded non-breaking space, properly URL encoded this then becomes %c2a0
and depending on the encoding you use to view you probably see  followed by a space. For
example:
> >
> > Because c2a0 is not considered whitespace (indeed, it is not real whitespace, that
is 00a0) by the Java Character class, the WhitespaceTokenizer won't split on it, but the WordDelimiterFilter
still does, somehow mitigating the problem as it becomes:
> >
> > HTMLSCF een abonnement
> > WT een abonnement
> > WDF een eenabonnement abonnement
> >
> > Should the WhitespaceTokenizer not include this weird edge case?
> >
> > Cheers,
> > Markus
> 

Mime
View raw message