lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Van Cuyck <>
Subject UAX29 URL Email Tokenizer not working as expected
Date Mon, 06 May 2019 11:22:33 GMT

The UAX29 URL Email Tokenizer is not working as expected.
According to the documentation ( "Words are split
at hyphens, unless there is a number in the word, in which case the token
is not split and the numbers and hyphen(s) are preserved."

So I expect "ABC-123" to remain "ABC-123"
However the term is split in 2 separate tokens "ABC" and "123".

Same for "AB12-CD34" --> "AB12" and "CD34" etc...

Is this behavior to be expected? Or is there a way to get the behavior I

Kind regards, Tom


Would you like to receive our newsletter to stay updated? Please click here

Tom Van Cuyck
Software Engineer

WINNER of EY scale-up of the year 2018
T: +32 9 292 80 37 <+32+9+292+80+37>
AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium
CIC, One Broadway, MA 02142 Cambridge, United States

DISCLAIMER This message (including any attachments) may contain information
which is confidential and/or protected by intellectual property rights and
is intended for the sole use of the recipient(s) named above. Any use of
the information herein (including, but not limited to, total or partial
reproduction, communication or distribution in any form) by persons other
than the designated recipient(s) is prohibited. If you have received it by
mistake, please notify the sender by return email and delete this message
from your system. Please note that emails are susceptible to change.
ONTOFORCE shall not be liable for the improper or incomplete transmission
of the information contained in this communication nor for any delay in its
receipt or damage to your system. ONTOFORCE does not guarantee that the
integrity of this communication is free of viruses, interceptions or

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message