abdera-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Berry <chriswbe...@gmail.com>
Subject Re: Invalid byte 2 of 3-byte UTF-8 sequence.
Date Wed, 05 Sep 2007 19:14:13 GMT

We figured it out. AFAICT, both my issue and Herbert's are the same.
I believe this is a bug in Abdera.

There are actually two issues;

First ,  Abdera uses HttpClient's


in order to obtain a raw stream bytes for Woodstox. (which is the  
correct thing to do for performance)

But Woodstox does NOT assume UTF-8.  So it fails when parsing valid  
UTF-8 characters.

The fix is to change the following line in AbstractClientResponse

   public <T extends Element>Document<T> getDocument( Parser parser,   
ParserOptions options)
          throws ParseException {
     try {
       // Document<T> doc = parser.parse( getInputStream(), base,  
       Document<T> doc = parser.parse(getReader(), base, options);

And to add the following method to AbstractClientResponse

   public java.io.Reader getReader() throws java.io.IOException {
     String header = getHeader("Content-Type");

     String type = "UTF-8"; // default to UTF-8
     java.util.regex.Matcher matcher = java.util.regex.Pattern.compile 
     if (matcher.matches()) {
       System.out.println("@@@@@@@@@@@@@@@@@@@@@@ type = " + type);
        type = matcher.group(1);

     return new java.io.InputStreamReader(getInputStream(), type);

Although, there is likely a cleaner way to get the "charset" param in  

Second,  Abdera is NOT adding the "charset" parameter (e.g.  
";charset=utf-8" ) to the Content-Type HTTP Header of the Response

So a fix might be to change the following line in BaseResponseContext::

   public BaseResponseContext(T base, boolean chunked) {
     this.base = base;
     this.chunked = chunked;
     try {

       //  setContentType(getContentType().toString());
       setContentType(getContentType().toString() + "; charset=utf-8");

     } catch (Exception e) {}

Although there are likely better ways/places to accomplish this  
within Abdera.
Perhaps I need to set this in my SpringAbderaServlet??

I will add these details to the JIRA as well.
-- Chris 

On Sep 5, 2007, at 11:53 AM, James M Snell wrote:

> Hmmm... how odd.  Ok, let me explore a bit further.
> - James
> herbert wrote:
>> Hi!
>> I've already tried that before.
>> Using the escape sequence \u00e4 also does *not* work.
>> Herbert

S'all good  ---   chriswberry at gmail dot com

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message