lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: np-pandock search problem (again, with more detail)
Date Thu, 07 Jun 2007 22:30:31 GMT
Actually, my mind kind of overloaded when I read the following from
the (2.1)  javadoc....

   - Splits words at punctuation characters, removing punctuation.
   However, a dot that's not followed by whitespace is considered part of a
   - Splits words at hyphens, unless there's a number in the token, in
   which case the whole token is interpreted as a product number and is not
   - Recognizes email addresses and internet hostnames as one token.

All of which is just fine and I'm glad someone else wrote the grammar,
but I'm finding more and more that I'm constructing my own
analyzers/tokenizers instead since I'm in a specialized space, often
massaging the input stream outside the analyzer. For instance, is
O'Hara best tokenized as one, two, or three tokens? In genealogy,
it's best tokenized as ohara, which none of the standard analyzers
would treat "properly". As in "just the way I want it to be treated" <G>....


On 6/7/07, Michael D. Curtin <> wrote:
> Doron Cohen wrote:
> >>From the StandardAnalyzer javacc grammar :
> >   // floating point, serial, model numbers, ip addresses, etc.
> >   // every other segment must have at least one digit
> >   <NUM: (<ALPHANUM> <P> <HAS_DIGIT> .... etc.
> >   <#P: ("_"|"-"|"/"|"."|",") >
> > My understanding of this: a non-whitespace sequence is broken
> > at either of these 5 chars
> >    _  -  /  .  ,
> > unless the part that follows part has a digit, in which case
> > it is assumed to be (part of) a serial no., model, etc.
> Weird.  The definition seems to allow expressions of the form
> A-B-C-D-E-..., where
> -   "-" can be one of the five characters you mentioned
> -   the A, B, C, ... are alphanumeric pseudo-words
> -   A, C, E, ... or B, D, F, ... must have digits, i.e. alternating
>      digit components
> So "A-1-B-2" and "1-A-2-B" would be kept as single tokens, but "A-B-1-2"
> would not.  Seems more than a little hokey, but I suppose it's been
> working for a long time, for the most part.
> --MDC
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message