lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <>
Subject Re: UAX29 URL Email Tokenizer not working as expected
Date Tue, 07 May 2019 08:34:35 GMT
Hi Tom,

The documentation is wrong.  The sentence you quoted was inherited from Classic Tokenizer's
description.  UAX 29 URL Email Tokenizer is a specialization of Standard Tokenizer, the 7.2
documentation for which says the following:

    Note that words are split at hyphens.

I've made an issue to fix the Solr ref guide:

If you don't need the UAX#29 word break rules and identification of URLs and emails, you could
switch to Classic Tokenizer, which handles hyphens like you want.

Alternatively, if you want to continue using UAX29 URL Email Tokenizer, you could use a (pre-tokenization)
char filter to convert hyphens to something that won't trigger a word break, and then a (post-tokenization)
token filter to convert back to a hyphen, e.g. something like (untested; "_._" is an example
of a string that is unlikely to occur in your data and which will not trigger a word break[1]):

  <charFilter class="solr.PatternReplaceCharFilterFactory"
              pattern="(\d[A-Za-z]*)-([A-Za-z]*\d)" replacement="$1_._$2"/>
  <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
  <filter class="solr.PatternReplaceFilterFactory"
          pattern="_\._" replacement="-"/>

(I'm guessing you'll need more than one PatternReplaceCharFilterFactory instance to handle
all permutations.)

FYI the following note from UAX#29 explains why the default word break rules have hyphens
trigger word breaks:

    The correct interpretation of hyphens in the context
    of word boundaries is challenging. It is quite common
    for separate words to be connected with a hyphen:
    “out-of-the-box,” “under-the-table,” “Italian-American,”
    and so on. A significant number are hyphenated names,
    such as “Smith-Hawkins.” When doing a Whole Word Search
    or query, users expect to find the word within those
    hyphens. While there are some cases where they are
    separate words (usually to resolve some ambiguity such
    as “re-sort” as opposed to “resort”), it is better
    overall to keep the hyphen out of the default
    definition. Hyphens include U+002D HYPHEN-MINUS, 
    U+2010 HYPHEN, possibly also U+058A ARMENIAN HYPHEN,


[1] To figure out which chars to use to not trigger a word break, look at rules WB6, WB7,
WB8 & WB9 ( etc.) - "×" in these rules means "do
not break".  The MidLetter and MidNumLet character sets are your best bet for such chars: ,

> On May 6, 2019, at 7:22 AM, Tom Van Cuyck <> wrote:
> Hi,
> The UAX29 URL Email Tokenizer is not working as expected.
> According to the documentation (
> "Words are split
> at hyphens, unless there is a number in the word, in which case the token
> is not split and the numbers and hyphen(s) are preserved."
> So I expect "ABC-123" to remain "ABC-123"
> However the term is split in 2 separate tokens "ABC" and "123".
> Same for "AB12-CD34" --> "AB12" and "CD34" etc...
> Is this behavior to be expected? Or is there a way to get the behavior I
> expect?
> Kind regards, Tom
> -- 
> Would you like to receive our newsletter to stay updated? Please click here
> <>
> Tom Van Cuyck
> Software Engineer
> <>
> WINNER of EY scale-up of the year 2018
> @:
> T: +32 9 292 80 37 <+32+9+292+80+37>
> W:
> W:
> AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium
> <>
> CIC, One Broadway, MA 02142 Cambridge, United States
> <,+1+Broadway,+Cambridge,+MA+02142/@42.3627659,-71.0857549,17z/data=!3m2!4b1!5s0x89e370a5bef53651:0xa9387af4906ce9a3!4m5!3m4!1s0x89e370a5b9258c7b:0x7d922521464507ad!8m2!3d42.3627822!4d-71.0835375>
> DISCLAIMER This message (including any attachments) may contain information
> which is confidential and/or protected by intellectual property rights and
> is intended for the sole use of the recipient(s) named above. Any use of
> the information herein (including, but not limited to, total or partial
> reproduction, communication or distribution in any form) by persons other
> than the designated recipient(s) is prohibited. If you have received it by
> mistake, please notify the sender by return email and delete this message
> from your system. Please note that emails are susceptible to change.
> ONTOFORCE shall not be liable for the improper or incomplete transmission
> of the information contained in this communication nor for any delay in its
> receipt or damage to your system. ONTOFORCE does not guarantee that the
> integrity of this communication is free of viruses, interceptions or
> interference.

View raw message