[ https://issues.apache.org/jira/browse/TIKA-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573493#comment-13573493 ] Alberto Ornaghi commented on TIKA-1080: --------------------------------------- I know windows is bad, but i don't think is a problem of the terminal. This is why my test was done via network with the --server option. The strange part is that in HTML format it works, and in text mode not. My "receiving" terminal was under osx which fully support utf-8. Unfortunately in our project we are forced to execute tika under windows and i'm trying to extract text only arabic char. My current workaround is to get the html and then strip the html entities, but it's not optimal. why text only differs from html output? > Arabic characters under windows > ------------------------------- > > Key: TIKA-1080 > URL: https://issues.apache.org/jira/browse/TIKA-1080 > Project: Tika > Issue Type: Bug > Components: parser, server > Affects Versions: 1.3 > Environment: Windows 2003 or Windows 2008 > Reporter: Alberto Ornaghi > Attachments: arabic.docx > > > If tika is executed under windows the text mode (--text) is failing to extract arabic chars and outputs only question marks. The same behaviour occurs if tika is executed as a server. The issue is not present in the GUI, only commandline. The issue is not present if the output is html. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira