lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject WhitespaceTokenizer to consider incorrectly encoded c2a0?
Date Wed, 08 Oct 2014 13:59:31 GMT
Hi,

For some crazy reason, some users somehow manage to substitute a perfectly normal space with
a badly encoded non-breaking space, properly URL encoded this then becomes %c2a0 and depending
on the encoding you use to view you probably see  followed by a space. For example:

Because c2a0 is not considered whitespace (indeed, it is not real whitespace, that is 00a0)
by the Java Character class, the WhitespaceTokenizer won't split on it, but the WordDelimiterFilter
still does, somehow mitigating the problem as it becomes:

HTMLSCF een abonnement
WT een abonnement
WDF een eenabonnement abonnement

Should the WhitespaceTokenizer not include this weird edge case? 

Cheers,
Markus

Mime
View raw message