lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From András Péteri <apet...@b2international.com>
Subject Re: Quiz question: Which Character.isSpaceChar but not isWhitespace?
Date Sun, 01 Nov 2015 23:14:58 GMT
Hi David,

While I agree on the quirkiness, at least it's documented (and the method
is probably kept as-is for backwards compatibility reasons); the first
bullet point of the corresponding Java SE 7 page says: "It is a Unicode
space character [...] but is not also a non-breaking space" [1].

You can still override the isTokenChar method of WhitespaceTokenizer or
CharTokenizer in a subclass to exclude an extra set of characters from the
allowed range. If you are using Google's Guava library in your project,
they have a character matching predicate class which follows the Unicode
specification more closely [2]; this can also be used in isTokenChar as a
replacement.

[1]
http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(char)
[2]
http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/base/CharMatcher.html#WHITESPACE

On Fri, Oct 30, 2015 at 9:10 PM, david.w.smiley@gmail.com <
david.w.smiley@gmail.com> wrote:

> One would think that all “space characters” are by definition
> “whitespace”.  Not true!:
> http://www.fileformat.info/info/unicode/char/00a0/index.htm
>
> So I’m working on an app where I can no longer use WhitespaceTokenizer
> since I need to check for isSpacheChar *OR* isWhitespace.  Alternatively I
> could use MappingCharFilter, I realize.
>
> This had trickle-down effects on a search platform I’m working on that was
> triggered by a user’s search.  It’s caused all sorts of head-scratching
> till we discovered what’s going on.
>
> Craziness.
>
> ~ David
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>

-- 
András Péteri

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message