lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Resolved: (LUCENE-1556) some valid email address characters not correctly recognized
Date Wed, 29 Sep 2010 05:50:33 GMT


Robert Muir resolved LUCENE-1556.

    Fix Version/s: 3.1
       Resolution: Fixed

fixed in LUCENE-2167

> some valid email address characters not correctly recognized
> ------------------------------------------------------------
>                 Key: LUCENE-1556
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 2.4.1
>            Reporter: Paul Nilsson
>            Priority: Trivial
>             Fix For: 3.1, 4.0
> the EMAIL expression in StandardTokenizerImpl.jflex misses some unusual but valid characters
in the left-hand-side of the email address. This causes an address to be broken into several
tokens, for example:
> gets broken into "somename" and ""
> husband& gets broken into "husband" and ""
> These seem to be occurring more often. The first seems to be because of an anti-spam
trick you can use with google (see:
I see the second in several domains but a disproportionate amount are from, so
I expect it's a signup suggestion from the service.
> Perhaps a fix would be to change line 102 of StandardTokenizerImpl.jflex from:
> EMAIL      =  {ALPHANUM} (("."|"-"|"_") {ALPHANUM})* "@" {ALPHANUM} (("."|"-") {ALPHANUM})+
> to 
> EMAIL      =  {ALPHANUM} (("."|"-"|"_"|"+"|"&") {ALPHANUM})* "@" {ALPHANUM} (("."|"-")
> I'm aware that the StandardTokenizer is meant to be more of a basic implementation rather
than an implementation the full standard, but it is quite useful in places and hopefully this
would improve it slightly.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message