lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peyman Faratin <pey...@robustlinks.com>
Subject Re: StandardTokenizer
Date Fri, 30 Sep 2011 13:47:18 GMT
thank you Ian

On Sep 30, 2011, at 4:19 AM, Ian Lea wrote:

> This all changed with the 3.1 release.  See
> http://lucene.apache.org/java/3_1_0/changes/Changes.html#3.1.0.api_changes,
> number 18.
> 
> You can get the old behaviour with StandardAnalyzer by passing
> VERSION_30, or you could look at UAX29URLEmailTokenizer which should
> pick up the email component, although probably not the apostrophe.
> 
> 
> --
> Ian.
> 
> 
> On Thu, Sep 29, 2011 at 7:51 PM, Peyman Faratin <peyman@robustlinks.com> wrote:
>> Hi
>> 
>> I have a sentence
>> 
>> "i'll email you at x@abc.com"
>> 
>> and I am looking at the tokens a StandardAnalyzer (which uses the StandardTokenizer)
produces
>> 
>> 1: [i'll:0->4:<ALPHANUM>]
>> 2: [email:5->10:<ALPHANUM>]
>> 3: [you:11->14:<ALPHANUM>]
>> 5: [x:18->19:<ALPHANUM>]
>> 6: [abc.com:20->27:<ALPHANUM>]
>> 
>> I am using the following constructor
>> 
>>    new StandardAnalyzer(Version.LUCENE_32),
>> 
>> My question is:
>> 
>> 1- shouldn't we be seeing a token x@abc.com (since that is the grammar of StandardAnalyzer?,
and
>> 
>> 2- shouldn't the token type be "email" for abc.com and "apostrophe" for "i'll"?
>> 
>> thank you
>> 
>> Peyman
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message