lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Changing the Punctuation definition for StandardAnalyzer
Date Thu, 20 Dec 2007 21:32:16 GMT

I should have mentioned before, I have Lucene 1.9.1.

In fact I had previously located the grammar in StandardTokenizer.jj (just
wasn't sure if that was the one u were talking about) and had commented
out EMAIL entries from all the following files:

But evidently the tokenizer was expecting the email addresses to be one of
the other TOKEN types. But since they were matching with none of them it
was throwing a ParseException.

Now what is puzzling to me is that though I don't see the '@' (unicode
value 0040) sign to be included in "LETTER" or any other definition, why
is it not  splitting the words? It certainly isn't, which is why Tokenizer
is expecting the email address to be defined as a TYPE. My understanding,
looking at the code, is that whichever characters were not defined in the
grammar, would be acting as splitter, since they are not contributing to
any TOKEN definition.

Please let me know what I am missing.


> 20 dec 2007 kl. 20.21 skrev
>> I would rather like to modify the lexer grammar. But exactly where
>> it is
>> defined. After having a quick look, seems like
>> may be where it is being done.
> It can be generated with the Ant build.
> --
> karl
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message