lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Murray Altheim <>
Subject Re: encoding of german analyzer source files
Date Fri, 26 Nov 2004 22:29:18 GMT
Andi Vajda wrote:
>>I can tell the NetBeans-IDE the encoding of every single source file. But the 
>>problem is that I might not know which the correct encoding is. In case of 
>>Lucene it is quite clear because it is mentioned in the build.xml file. But 
>>what is the situation if someone sends you a stemmer class for example for 
>>Swahili and you do not know in which encoding the author wrote the source. 
>>Then you can try lots of encodings until the java compiler will be satisfied 
>>with it. And even then you might not be sure that you used the right 
>>Therefore it would be great if all Java programmers would agree on the same 
>>encoding of source files (let it be UTF-8, ISO-8859-1 or something really
> Actually, the reason for the change to utf-8 was that for Lucene to compile on 
> Windows with gcj (mingw), the encoding better be utf-8 because of the typical 
> absence of iconv facility there. Therefore, it would be safe to assume the 
> swahili stemmer source to also be encoded in utf-8.
> Andi..


It may seem pretty safe to assume from practice, but from the Java
programmer's point of view, it's still not. It's perfectly possible
that the Swahili file be in UTF-8 or UTF-16, little-endian or big-
endian, or perhaps some other encoding we don't even know about.
A minor point I was trying to make is that absent some external
mechanism, there's really *no way* to know the encoding of a file.
You can sniff the first few bytes (which is what is recommended
in the XML 1.0 spec, you can see how they do it there), but making
such an assumption may lead to program failure if the assumption
is incorrect.

   Extensible Markup Language (XML) 1.0 (Third Edition)
   Appendix F Autodetection of Character Encodings

The suggestions there are pretty usable for files that have nothing
to do with XML.

I don't know how many people on this list are familiar with
O'Reilly's "CJKV Information Processing" (with the puffer fish on
the cover), which opened up my eyes to a new world. After reading
it I got a terrible fright and couldn't sleep for weeks.

   "CJKV Information Processing: Chinese, Japanese, Korean
      & Vietnamese Computing", by Ken Lunde, O'Reilly Publishing.


Murray Altheim          
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

   [International Committee of the Red Cross director] Kraehenbuhl
   pointed out that complying with international humanitarian law
   was "an obligation, not an option", for all sides of the conflict.
   "If these rules or any other applicable rules of international
   humanitarian law are violated, the persons responsible must be
   held accountable for their actions," he said. -- BBC News

  "In my judgment, this new paradigm [the War on Terror] renders
   obsolete Geneva's strict limitations on questioning of enemy
   prisoners and renders quaint some of its provisions [...]
   Your determination [that the Geneva Conventions] does not apply
   would create a reasonable basis in law that [the War Crimes Act]
   does not apply, which would provide a solid defense to any future
   prosecution." -- Alberto Gonzalez, appointed US Attorney General,
   and likely Supreme Court nominee, in a memo to George W. Bush

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message