tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-469) The Parser is not correctly outputting Arabic text documents
Date Wed, 16 Feb 2011 16:53:24 GMT

    [ https://issues.apache.org/jira/browse/TIKA-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995382#comment-12995382

Ken Krugler commented on TIKA-469:

Hi Robert - do you have an example of an HTML file?

I'm asking because if an HTML document is encoded as UTF-8, the only reasona I can think of
for the character encoding to be messed up is if (a) the HTML meta tag uses an encoding name
that isn't supported by Java, or (b) there is no charset specified in the response header
or the HTML meta tags, and the algorithmic detection of the character encoding is also failing.


-- Ken

> The Parser is not correctly outputting Arabic text documents
> ------------------------------------------------------------
>                 Key: TIKA-469
>                 URL: https://issues.apache.org/jira/browse/TIKA-469
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows XP
>            Reporter: Robert Cullen
>         Attachments: TEST_WORD.doc, fever_factsheet_arabic.pdf
> The parser is not preserving the character encoding when parsing documents in Arabic
UTF-8, specifically with .pdf and .doc.  The resulting character output is undechipherable
or just question-mark symbols.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message