tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominique Béjean (JIRA) <j...@apache.org>
Subject [jira] Issue Comment Edited: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document
Date Tue, 02 Nov 2010 12:04:25 GMT

    [ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927367#action_12927367
] 

Dominique Béjean edited comment on TIKA-517 at 11/2/10 8:02 AM:
----------------------------------------------------------------

Hi,

Thank you for these replies.

In order to provide a sample of my code, I made some tests and I can't reproduce the issue
anymore.

My code looks like :

input = new FileInputStream("russian.pdf");
contentType="application/pdf";
outputEncoding="UTF-8";

ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
context.set(Parser.class, parser);

Metadata metadata = new Metadata();
metadata.add("stream_content_type", contentType);
StringWriter writer = new StringWriter();
BaseMarkupSerializer serializer = null;
serializer = new TextSerializer();
serializer.setOutputCharStream(writer);
serializer.setOutputFormat(new OutputFormat("text", outputEncoding, true));
parser.parse(input, serializer, metadata, context);
writer.close();

content = writer.toString();


If I reproduce the problem later, I will provide details.

Dominique

      was (Author: dbejean):
    Hi,

Thank you for these replies.

In order to provide the a sample of my code, I made some tests and I can't reproduce the issue
anymore.

My code looks like :

input = new FileInputStream("russian.pdf");
contentType="application/pdf";
outputEncoding="UTF-8";

ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
context.set(Parser.class, parser);

Metadata metadata = new Metadata();
metadata.add("stream_content_type", contentType);
StringWriter writer = new StringWriter();
BaseMarkupSerializer serializer = null;
serializer = new TextSerializer();
serializer.setOutputCharStream(writer);
serializer.setOutputFormat(new OutputFormat("text", outputEncoding, true));
parser.parse(input, serializer, metadata, context);
writer.close();

content = writer.toString();


If I reproduce the problem later, I will provide details.

Dominique
  
> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian,
..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
> 	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> 	...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used
for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message