[ https://issues.apache.org/jira/browse/TIKA-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573587#comment-13573587 ] Uwe Schindler commented on TIKA-1080: ------------------------------------- In any case, I don't think the TIKA server should be dependent on the default locale/charset! A server-side application like the TIKA server schould in any case parse input and send output in a well-defined charset (e.g., UTF-8). Otherwise clients of the TIKA server app need knowledge of the default encoding on the server. I would recommend to make TIKA completely locale/charset/... independent. In Lucene we have a ANT plugin, checking the bytecode of all our classes, that they don't use any methods that depend on default charsets. Recently (this week), I forked this plugin into a separate Google Code project and it is now useable by ANT or Maven: https://code.google.com/p/forbidden-apis/ and the corresponding blog entry http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html I would strongly suggest to add the Maven Mojo (available via Maven central). There needs to be some update on the webpage with usage instructions, but I can provide a patch. > Arabic characters under windows > ------------------------------- > > Key: TIKA-1080 > URL: https://issues.apache.org/jira/browse/TIKA-1080 > Project: Tika > Issue Type: Bug > Components: parser, server > Affects Versions: 1.3 > Environment: Windows 2003 or Windows 2008 > Reporter: Alberto Ornaghi > Attachments: arabic.docx > > > If tika is executed under windows the text mode (--text) is failing to extract arabic chars and outputs only question marks. The same behaviour occurs if tika is executed as a server. The issue is not present in the GUI, only commandline. The issue is not present if the output is html. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira