lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bernd Fehling (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-2381) The included jetty server does not support UTF-8
Date Tue, 08 Mar 2011 15:01:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003980#comment-13003980
] 

Bernd Fehling commented on SOLR-2381:
-------------------------------------

Robert, unfortunately I wasn't able to build a reproducible test so I decided to debug it
on my server.
The bug is in Jetty and has been fixed with jetty-7.3.1.v20110307.
Because I started debugging during weekend I used the older jetty.7.3.0 with the bug included,
located the bug 
and recognized today that it had just been fixed in the new version from yesterday.

Nevertheless here is the description because I went through all the bits and bytes.
In jetty-7 there is jetty-server with org.eclipse.jetty.server.HttpWriter.java.
That is the OutputWriter which extends Writer and does the UTF-8 encoding.
The buffer comes of size 8192 bytes and is chunked and encoded with HttpWriter in sizes of
512 bytes.
The encoding is that in java it is UTF-16 and is read as integer. If the code is above BMP
ist has a surrogate
which is read first and thereafter the next integer.
Excample: 55349(dec) and 56320(dec) is converted to 119808(10) which is U+1D400

Remember that the buffer is of size 512 bytes. But what if the counter is at 510 and a Unicode
above
BMP comes up? The solution is to write the current buffer to output, reset it and start over
with an empty
buffer. And here is/was the bug.
The "surrogate reminder" was cleared to early at a wrong place and got lost.

If I find a svn with jetty-6.1.26 sources I will look into that one also.
Otherwise use jetty-7.3.1-v20110307 that is fixed.

May be we should setup a xml page for testing that has at least more than 512 characters of
UTF-8 code 
above BMP in a row for testing?


> The included jetty server does not support UTF-8
> ------------------------------------------------
>
>                 Key: SOLR-2381
>                 URL: https://issues.apache.org/jira/browse/SOLR-2381
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2381.patch, SOLR-ServletOutputWriter.patch, jetty-6.1.26-patched-JETTY-1340.jar,
jetty-util-6.1.26-patched-JETTY-1340.jar
>
>
> Some background here: http://www.lucidimagination.com/search/document/6babe83bd4a98b64/which_unicode_version_is_supported_with_lucene
> Some possible solutions:
> * wait and see if we get resolution on http://jira.codehaus.org/browse/JETTY-1340. To
be honest, I am not even sure where jetty is being maintained (there is a separate jetty project
at eclipse.org with another bugtracker, but the older releases are at codehaus).
> * include a patched version of jetty with correct utf-8, using that patch.
> * remove jetty and include a different container instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message