lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From (Matthias Krueger)
Subject Indexing UTF-8 and lexical errors
Date Tue, 14 Oct 2003 10:06:42 GMT

I am trying to index UTF-8 encoded HTML files with content in various
languages with Lucene. So far I always receive a message

"Parse Aborted: Lexical error at line 146, column 79.
Encountered: "\u2013" (8211), after : "" "

when trying to index files with Arabic words. I am aware of the fact
that tokenizing/analyzing/stemming non-latin characters has some issues
but for me tokenizing would be enough. And that should work with Arabic,
Russian etc. shouldn't it ?

So, what steps do I have to take to make Lucene index non-latin
languages/characters encoded in UTF-8 ?

Thank you very much,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message