"Michael D. Curtin" <mike@curtin.com> wrote on 07/06/2007 13:30:28:
> > I think it splits by hyphens unless the no-hyphen
> > part has digits, so:
> > np-pandock-a7
> > becomes
> > np
> > pandock-a7
> > This is for the indexing part.
>
> Wow! Do you know the thinking behind that, i.e. why a number in a
> hyphenated expression prevents the split?
I actually asked myself the same question before the previous
post - javadocs for StandardAnalyzer just has the obvious - a
grammar-based tokenizer constructed with JavaCC.... - the wiki
page AnalysisParalysis also didn't explain much on the logic
behind it.
>From the StandardAnalyzer javacc grammar :
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
<NUM: (<ALPHANUM> <P> <HAS_DIGIT> .... etc.
<#P: ("_"|"-"|"/"|"."|",") >
My understanding of this: a non-whitespace sequence is broken
at either of these 5 chars
_ - / . ,
unless the part that follows part has a digit, in which case
it is assumed to be (part of) a serial no., model, etc.
Seems we can improve the documentation here.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|