lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "MOYSE Gilles (Cetelem)" <>
Subject RE: Indexing UTF-8 and lexical errors
Date Tue, 14 Oct 2003 13:14:25 GMT

You should edit the StandardTokenizer.jj file. It contains all the
definitions to generate the class, that you certainly
At the end of the StandardTokenizer.jj file, you'll find the definition of
the LETTER token. You'll see all the accepted letters, in Unicode. If you
want a table of the different Unicodes, go there :
In the LETTER token definition in the .jj file, unicode are coded as ranges
(like "\u0030"-"\u0039") or as elements (like "\u00f1").
Adding the Arabic unicode ranges in this part may solve your problem (add a
line like "\u0600"-"\u06FF", since 0600-06FF is the range for arabic

Once modified, go to the root of your Lucene installation, and recompile the
StandardTokenizer.jj file with :
	ant compile
It should generate the java files (and even compile if I remember well)

Good Luck

Gilles Moyse

-----Message d'origine-----
De : []
Envoyé : mardi 14 octobre 2003 12:07
À :
Objet : Indexing UTF-8 and lexical errors

I am trying to index UTF-8 encoded HTML files with content in various
languages with Lucene. So far I always receive a message

"Parse Aborted: Lexical error at line 146, column 79.
Encountered: "\u2013" (8211), after : "" "

when trying to index files with Arabic words. I am aware of the fact
that tokenizing/analyzing/stemming non-latin characters has some issues
but for me tokenizing would be enough. And that should work with Arabic,
Russian etc. shouldn't it ?

So, what steps do I have to take to make Lucene index non-latin
languages/characters encoded in UTF-8 ?

Thank you very much,

To unsubscribe, e-mail:
For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message