tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1080) Arabic characters under windows
Date Thu, 07 Feb 2013 13:09:13 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573461#comment-13573461

Nick Burch commented on TIKA-1080:

Windows is bad for this. I strongly suspect you either have your terminal configured with
a different encoding to the default one given to Java (you'll almost certainly want to be
using UTF-8 for both), or you have your terminal configured with a font that can't render
arabic characters.

I've just tried with your file on Linux, where the terminal encoding and JVM encoding are
both set to UTF-8, and it renders just fine for me with --text
> Arabic characters under windows
> -------------------------------
>                 Key: TIKA-1080
>                 URL: https://issues.apache.org/jira/browse/TIKA-1080
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, server
>    Affects Versions: 1.3
>         Environment: Windows 2003 or Windows 2008
>            Reporter: Alberto Ornaghi
>         Attachments: arabic.docx
> If tika is executed under windows the text mode (--text) is failing to extract arabic
chars and outputs only question marks. The same behaviour occurs if tika is executed as a
server. The issue is not present in the GUI, only commandline. The issue is not present if
the output is html.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message