lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: WhitespaceTokenizer to consider incorrectly encoded c2a0?
Date Wed, 08 Oct 2014 14:11:27 GMT
Is this a suggestion for JIRA ticket? Or a question on how to solve
it? If the later, you could probably stick a RegEx replacement in the
UpdateRequestProcessor chain and be done with it.

As to why? I would look for the rest of the MSWord-generated
artifacts, such as "smart" quotes, extra-long dashes, etc.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 8 October 2014 09:59, Markus Jelsma <markus.jelsma@openindex.io> wrote:
> Hi,
>
> For some crazy reason, some users somehow manage to substitute a perfectly normal space
with a badly encoded non-breaking space, properly URL encoded this then becomes %c2a0 and
depending on the encoding you use to view you probably see  followed by a space. For example:
>
> Because c2a0 is not considered whitespace (indeed, it is not real whitespace, that is
00a0) by the Java Character class, the WhitespaceTokenizer won't split on it, but the WordDelimiterFilter
still does, somehow mitigating the problem as it becomes:
>
> HTMLSCF een abonnement
> WT een abonnement
> WDF een eenabonnement abonnement
>
> Should the WhitespaceTokenizer not include this weird edge case?
>
> Cheers,
> Markus

Mime
View raw message