lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [Jakarta Lucene Wiki] Updated: IndexingOtherLanguages
Date Thu, 08 Jul 2004 13:30:01 GMT
   Date: 2004-07-08T06:30:01
   Editor: <>
   Wiki: Jakarta Lucene Wiki
   Page: IndexingOtherLanguages

   no comment

Change Log:

@@ -10,7 +10,7 @@
  1. Know the encoding of the documents you wish to index.  Java assumes the native encoding
when reading in files unless you tell it otherwise.  To create a Reader that supports reading
in other encodings, see [
InputStreamReader].  I find it easiest to convert all of my files to UTF-8 before indexing,
and then I read them in by doing:[[BR]]
     `Reader reader = new InputStreamReader(new FileInputStream("path to file"), "UTF-8");`
-Note:  The demo supplied with Lucene does not support UTF-8 out of the box.  You will have
to modify it.
  2. Identify the Analyzer you will use or write your own if none exists.  There are many
great analyzers available that will index a wide variety of languages.  See [
Sandbox] for some.  Otherwise, look around the web.  If you are writing your own, consider
donating it to the Lucene Sandbox so that others can benefit from your brilliance.  See item
3. below for what is needed in a custom analyzer.
      'Put example of writing an Analyzer here'

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message