lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Murray Altheim <>
Subject Re: encoding of german analyzer source files
Date Fri, 26 Nov 2004 13:13:06 GMT
Stefan Wachter wrote:
> Hi Daniel,
> I am using NetBeans 3.6 which certainly is unicode aware. Yet, NetBeans 
> seems not to detect that the source files of Lucene are UTF-8 encoded 
> automatically. I guess that it uses the platform specific default 
> encoding which is ISO-8859-1 for my Linux operating system.

In linux you can set the default encoding both at platform-level,
at a user-level, and for individual applications. You're not forced
to stay within ISO-8859-1. Think about it this way: if that were
the case, how on a multi-user system like linux could a machine
support only one encoding? This sounds more like a NetBeans problem
than a OS problem. I don't use NetBeans, but there must be a way to
indicate the encoding beyond what your particular user settings are.
Otherwise, English programmers couldn't develop non-English programs,
which is hard to believe.

> I think what Java lacks is a means to indicate the encoding of source 
> files (e.g. <?java encoding="ISO-8859-1"?> in a XMLish way). The 
> encoding has to be fed into the system from the outside. What else could 
> be the reason for having an encoding switch to the java compiler? 
> Therefore I think it is best to have Java source files to be plain ASCII.

Java has quite a lot of localization features built into the
language. Yes, the encoding has to be specified, just as one
would have to tell any processor how to decode any given set
of bytes. Java itself is Unicode aware for anything dealing
with characters. For dealing with byte streams the encoding
has to be specified. Here's a good article on the subject:

As for crippling files by forcing them into plain ASCII, why
would we want to step back 20 years in computer science? It's
been a long-fought battle to get to where we are now, and the
desires of a few people to be able to look at a file in ASCII
are far outweighed by the rest of the world, whose languages
don't fit into that straitjacket. As was mentioned, it would
make the code a great deal harder to both read and manage.

I remember looking at a desktop publishing application
developed at StoneHand in 1996 that had Arabic, Gujarati,
Japanese, Chinese, English, and Hebrew on the screen at the
same time and thinking damn! pretty impressive! We now have
that kind of thing in our browsers and think little of it.
I'd hate to step back to pre-1996 again.

We should all be using Unicode-aware tools. It's what the rest
of the world is doing, even in the Anglocentric US. For an
international project like Lucene, there's no good reason to
step back in time to ASCII. There are many programmers using
the Lucene source code that have no problem with Unicode, and
it would not be in their interest to be suddenly reading
numeric character entities rather then normally-readable text.


Murray Altheim          
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

    [International Committee of the Red Cross director] Kraehenbuhl
    pointed out that complying with international humanitarian law
    was "an obligation, not an option", for all sides of the conflict.
    "If these rules or any other applicable rules of international
    humanitarian law are violated, the persons responsible must be
    held accountable for their actions," he said. -- BBC News

   "In my judgment, this new paradigm [the War on Terror] renders
    obsolete Geneva's strict limitations on questioning of enemy
    prisoners and renders quaint some of its provisions [...]
    Your determination [that the Geneva Conventions] does not apply
    would create a reasonable basis in law that [the War Crimes Act]
    does not apply, which would provide a solid defense to any future
    prosecution." -- Alberto Gonzalez, appointed US Attorney General,
    and likely Supreme Court nominee, in a memo to George W. Bush

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message