tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: AutoDetectParser is not parsing UTF-16 content types
Date Thu, 30 Aug 2012 21:39:16 GMT

On Aug 29, 2012, at 8:55am, chraj007 wrote:

> Hello,
>   Im trying to parse a file whose content type is UTF-16. Im unable to
> parse the document using the following code. Please Help me.
>       ContentHandler textHandler = new BodyContentHandler();
>        TeeContentHandler teeHandler 		= 	 new
> TeeContentHandler(textHandler);
>        parser.parse(input, teeHandler, metadata, context);      

Note that you don't need to use a TeeContentHandler here.

>        String tt = textHandler.toString();
> //to print the text
> byte[] converttoBytes = tt.getBytes("UTF-16");
>        String string = new String(converttoBytes, "utf-8");

The above code won't do what I think you're hoping it will do.

The call to getBytes("UTF-16") will return the tt string as character data encoded using UTF-16.

The second call says to generate a string from bytes that are character data encoding using
UTF-8 (which obviously isn't true).

>       System.out.println(string);
> but its printing along with all html tags.

I'm unclear on what you mean by this.

But as Jukka noted in his response, the issue is that you have a document which is encoded
as UTF-8, but the HTML has <meta http-equiv="Content-Type" content="text/html; charset=UTF-16">

Currently Tika treats this meta tag charset as the truth. See https://issues.apache.org/jira/browse/TIKA-539
for a discussion on this issue.


-- Ken

Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message