lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: WordDelimiterGraphFilter swallows emojis
Date Tue, 03 Jul 2018 13:21:42 GMT
On Tue, Jul 3, 2018 at 8:00 AM, Michael Sokolov <msokolov@gmail.com> wrote:
> WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> like punctuation and thus remove them, but we would like to be able to
> search for emoji and use this filter for handling dashes, dots and other
> intra-word punctuation.
>
> These filters identify non-word and non-digit characters by two mechanisms:
> direct lookup in a character table, and fallback to Unicode class. The
> character table can't easily be used to handle emoji since it would need to
> be populated with the entire Unicode character set in order to reach
> emoji-land. On the other hand, if we change the handling of emoji by class,
> and say treat them as word-characters, this will also end up pulling in all
> the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> some of these other symbols are more like punctuation (this class is a grab
> bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc
> https://www.compart.com/en/unicode/category/So). On the other other hand,
> how do we even identify emoji? I don't think the Java Character API is
> adequate to the task. Perhaps we must incorporate a table.

There are several unicode properties for doing emoji (see e.g. unicode
segmentation algorithms, and tagging function in ICUTokenizer), but
its not based on general category. Additionally emoji may not be
single character but sequences so its more involved than what
WordDelimiterFilter is really ready for. I also don't think we should
start storing/maintaining unicode property tables ourselves, if we
want to fix WordDelimiterFilter, it should just depend on ICU instead.

> Suppose we come up with a good way to classify emoji; then how should they
> be treated in this class? Sometimes they may be embedded in tokens with
> other characters: I see people using emoji and other symbols as part of
> their names, and sometimes they stand alone (with whitespace separation). I
> think one way forward here would be to treat these as a special class akin
> to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> CATENATE_EMOJI) as we have for those classes.
>
> Or maybe as a convenience, we provide a way to get a table that encodes the
> default classifications of all characters up to some given limit, and then
> let the caller modify it? That would at least provide an easy way to treat
> emoji as letters.

There is already a way to provide a table to this thing. But one
bigger issue is word delimiter filter doesn't operate on unicode
codepoints, so I don't think you are gonna be able to do what you
want, since most emoji are not in the BMP. WordDelimiterFilter is
really only suitable for categorizing characters in the BMP, it just
doesn't split surrogates.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message