tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1080) Arabic characters under windows
Date Thu, 07 Feb 2013 15:27:13 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573587#comment-13573587

Uwe Schindler commented on TIKA-1080:

In any case, I don't think the TIKA server should be dependent on the default locale/charset!
A server-side application like the TIKA server schould in any case parse input and send output
in a well-defined charset (e.g., UTF-8). Otherwise clients of the TIKA server app need knowledge
of the default encoding on the server.

I would recommend to make TIKA completely locale/charset/... independent. In Lucene we have
a ANT plugin, checking the bytecode of all our classes, that they don't use any methods that
depend on default charsets. Recently (this week), I forked this plugin into a separate Google
Code project and it is now useable by ANT or Maven: https://code.google.com/p/forbidden-apis/
and the corresponding blog entry http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html

I would strongly suggest to add the Maven Mojo (available via Maven central). There needs
to be some update on the webpage with usage instructions, but I can provide a patch.
> Arabic characters under windows
> -------------------------------
>                 Key: TIKA-1080
>                 URL: https://issues.apache.org/jira/browse/TIKA-1080
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, server
>    Affects Versions: 1.3
>         Environment: Windows 2003 or Windows 2008
>            Reporter: Alberto Ornaghi
>         Attachments: arabic.docx
> If tika is executed under windows the text mode (--text) is failing to extract arabic
chars and outputs only question marks. The same behaviour occurs if tika is executed as a
server. The issue is not present in the GUI, only commandline. The issue is not present if
the output is html.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message