lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From WHIRLYCOTT <p...@whirlycott.com>
Subject Re: Cyrillic characters
Date Tue, 18 Jul 2006 22:09:00 GMT
Crap, you're right.  I have a well-tested application that's using  
UTF-8 everywhere possible and I just tested with some Russian text.   
Solr's coughing up this as an exception:

Jul 18, 2006 6:00:05 PM org.apache.solr.core.SolrException log
SEVERE: java.lang.ArrayIndexOutOfBoundsException: 1
         at org.apache.solr.search.QueryParsing.parseSort 
(QueryParsing.java:141)
         at  
org.apache.solr.request.StandardRequestHandler.handleRequest 
(StandardRequestHandler.java:96)
         at org.apache.solr.core.SolrCore.execute(SolrCore.java:592)
         at org.apache.solr.servlet.SolrServlet.doGet 
(SolrServlet.java:94)
         at javax.servlet.http.HttpServlet.service(HttpServlet.java:596)
         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
         at org.mortbay.jetty.servlet.ServletHolder.handle 
(ServletHolder.java:428)
         at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch 
(WebApplicationHandler.java:473)
         at org.mortbay.jetty.servlet.ServletHandler.handle 
(ServletHandler.java:568)
         at org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
         at org.mortbay.jetty.servlet.WebApplicationContext.handle 
(WebApplicationContext.java:633)
         at org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
         at org.mortbay.http.HttpServer.service(HttpServer.java:909)
         at org.mortbay.http.HttpConnection.service 
(HttpConnection.java:820)
         at org.mortbay.http.HttpConnection.handleNext 
(HttpConnection.java:986)
         at org.mortbay.http.HttpConnection.handle 
(HttpConnection.java:837)
         at org.mortbay.http.SocketListener.handleConnection 
(SocketListener.java:245)
         at org.mortbay.util.ThreadedServer.handle 
(ThreadedServer.java:357)
         at org.mortbay.util.ThreadPool$PoolThread.run 
(ThreadPool.java:534)

You're going directly against Solr/Jetty, right?  Not proxied or  
mod_rewrite'd through to Apache?

Solr isn't properly encoding the data being received by the servlet.   
I think that I can fix this using some of the tricks that I've  
learned in building my site.  More later.

How much testing have people done using UTF-8 data on Solr?

phil.



On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote:

> Hi all,
>
>    I'm trying to adapt our old cocoon/lucene based web search  
> application to one that is more solrish.  Our old web app was  
> capable of searching for queries with cyrillic characters in them.   
> I'm finding that using the packaged example admin interface  
> entering a query with a string of cyrillic characters causes a  
> java.lang.ArrayIndexOutOfBoundsException. I've also noted that the  
> url built from the search form is not utf-8 encoded.  So obviously  
> if I try to manipulate the query string by inserting a utf-8  
> encoded string in the q= parameter the values are interpreted  
> incorrectly and as such I cannot use this approach as a work- 
> around.  My sample query is: ...... (the english word _canada_  
> translated into russian) or %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0  
> (utf-8) or %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26% 
> 231076%3B%26%231072%3B (solr url encoding)
>
>    I would appreciate any advice or suggestions that would allow me  
> to search for cyrillics in solr.  If anyone knows why solr is  
> behaving as it does with the strange encoding, a brief explanation  
> of what causes this behaviour could be helpful and what the  
> encoding is (unicode?).  If anyone else has force solr to accept  
> utf-8 encoded q= parameters with success I would love to know how  
> you did it.
>
> Thanks in advance!
> Tricia
>
> ps.  I am using mozilla firefox as my main browser which leads to  
> the behaviour I reported above.  IE 6.0 works fine for cyrillics  
> although there is still a strange but different encoding (%CA%E0%ED% 
> E0%E4%E0 for the same query as before).


--
                                    Whirlycott
                                    Philip Jacob
                                    phil@whirlycott.com
                                    http://www.whirlycott.com/phil/



Mime
View raw message