lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Øie <>
Subject RE: scandinavian characters.
Date Tue, 27 Nov 2001 14:20:58 GMT
that is a good step on the way, I added;

| <#_ALPHANUM_CHAR: [ "a"-"z", "A"-"Z", "0"-"9", "\u0080"-"\uFFFE" ] >

into the "QueryParser.jj" and now it doesn't create any exceptions on øæå,
but it's still translated into &auml; ?!?

the strange thing is that the cvs version actually already has this into
it's code.. perhaps I should try a full rebuild from the cvs version...

could you send me your "QueryParser.jj" so i could have a look at it?

btw: thanks for the tips!

mvh karl øie

-----Original Message-----
From: Jonas Bechlund []
Sent: 27. november 2001 13:52
To: 'Lucene Users List'
Subject: RE: scandinavian characters.

Hi Karl,

It is a little bit tricky - but when you get the idea it is not that bad...

I had the same problem with the danish characters. I made changes TOKEN
definition in the "Token Definitions" section of the file "QueryParser.jj"
and that actually solved the problem. One minor detail is that you have to
rebuild the jar file with ANT. (See build.txt for instructions)

I guess that solves your problem,
/ Jonas

-----Original Message-----
From: Karl Øie []
Sent: 27 November 2001 13:01
To: Lucene Users List
Subject: RE: scandinavian characters.

there must be something seriously broken with the queryparse code.

if a query starts with ø/æ/å (&oslash;, &oaelig;, &aring;) then an exception
in the queryparser occurs.

org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column
1.  Encountered: "\u00c3" (195), after : ""
	at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source)
	at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown
	at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
	at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
	at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)

but if the query contains ø/æ/å (&oslash;, &oaelig;, &aring;) then it is
translated wrongly into the swedish/german &auml; regardless of what
character it was.

if someone could point me to where to start I could try to find the problem
because I guess it is errorous unicode translation...

mvh karl

>no it's even stranger than that, i have decoded the querystring, the
>is that it seems like something is changed on the way in. if i search for
>"fjøs" (fj&oslash;s) i get the swedish "fjä" (fj&Auml;). Where &oslash;
>changed to &Auml; and 's' is removed.
>is the querystring translated some where?
>mvh karl øie
>  -----Original Message-----
>  From: David Bonilla []
>  Sent: 27. november 2001 10:43
>  To: Lucene Users List;
>  Subject: Re: scandinavian characters.
>  Hi Karl !!!
>  I´m spanish and I have a lot of problems programming with our not english
>characters. I use LUCENE with spanish accents and it works fine...
>  Have you tried to use the and
>your fields to index ?
>  Best Regards from Spain !
>  __________________________
>  David Bonilla Fuertes
>  Profesor Waksman, 8, 6º B
>  28036 Madrid
>  Tel.: (+34) 914 577 747
>  Móvil: 656 62 83 92
>  Fax: (+34) 914 586 176
>  __________________________

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message