[ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927367#action_12927367
]
Dominique Béjean edited comment on TIKA-517 at 11/2/10 8:02 AM:
----------------------------------------------------------------
Hi,
Thank you for these replies.
In order to provide a sample of my code, I made some tests and I can't reproduce the issue
anymore.
My code looks like :
input = new FileInputStream("russian.pdf");
contentType="application/pdf";
outputEncoding="UTF-8";
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
context.set(Parser.class, parser);
Metadata metadata = new Metadata();
metadata.add("stream_content_type", contentType);
StringWriter writer = new StringWriter();
BaseMarkupSerializer serializer = null;
serializer = new TextSerializer();
serializer.setOutputCharStream(writer);
serializer.setOutputFormat(new OutputFormat("text", outputEncoding, true));
parser.parse(input, serializer, metadata, context);
writer.close();
content = writer.toString();
If I reproduce the problem later, I will provide details.
Dominique
was (Author: dbejean):
Hi,
Thank you for these replies.
In order to provide the a sample of my code, I made some tests and I can't reproduce the issue
anymore.
My code looks like :
input = new FileInputStream("russian.pdf");
contentType="application/pdf";
outputEncoding="UTF-8";
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
context.set(Parser.class, parser);
Metadata metadata = new Metadata();
metadata.add("stream_content_type", contentType);
StringWriter writer = new StringWriter();
BaseMarkupSerializer serializer = null;
serializer = new TextSerializer();
serializer.setOutputCharStream(writer);
serializer.setOutputFormat(new OutputFormat("text", outputEncoding, true));
parser.parse(input, serializer, metadata, context);
writer.close();
content = writer.toString();
If I reproduce the problem later, I will provide details.
Dominique
> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
> Key: TIKA-517
> URL: https://issues.apache.org/jira/browse/TIKA-517
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: Macosx, Java 6, Eclipse
> Reporter: Dominique Béjean
> Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian,
..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException:
> at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> ...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used
for various format text extraction, but after.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|