lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject WordDelimiterGraphFilter swallows emojis
Date Tue, 03 Jul 2018 12:00:37 GMT
WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
like punctuation and thus remove them, but we would like to be able to
search for emoji and use this filter for handling dashes, dots and other
intra-word punctuation.

These filters identify non-word and non-digit characters by two mechanisms:
direct lookup in a character table, and fallback to Unicode class. The
character table can't easily be used to handle emoji since it would need to
be populated with the entire Unicode character set in order to reach
emoji-land. On the other hand, if we change the handling of emoji by class,
and say treat them as word-characters, this will also end up pulling in all
the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
some of these other symbols are more like punctuation (this class is a grab
bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc On the other other hand,
how do we even identify emoji? I don't think the Java Character API is
adequate to the task. Perhaps we must incorporate a table.

Suppose we come up with a good way to classify emoji; then how should they
be treated in this class? Sometimes they may be embedded in tokens with
other characters: I see people using emoji and other symbols as part of
their names, and sometimes they stand alone (with whitespace separation). I
think one way forward here would be to treat these as a special class akin
to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
CATENATE_EMOJI) as we have for those classes.

Or maybe as a convenience, we provide a way to get a table that encodes the
default classifications of all characters up to some given limit, and then
let the caller modify it? That would at least provide an easy way to treat
emoji as letters.

Any thoughts?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message