tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TIKA-867) UTF-8 encoding does not work on windows
Date Fri, 18 May 2012 17:23:08 GMT

     [ https://issues.apache.org/jira/browse/TIKA-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting resolved TIKA-867.

    Resolution: Not A Problem

The rationale why we did TIKA-324 is that the default platform encoding as reported by the
JVM on Mac OS X is different than what the console is using.

AFAICT this is not the case on Windows. The reason you're seeing a problem here is that you're
explicitly parsing the output as UTF-8 instead of using the default platform encoding. Using
UTF-8 is fine, but as you say, you'll then want to explicitly tell Tika with the -e option
to encode its output as UTF-8 instead of using the default encoding.

To see the difference and the rationale for why Tika uses the platform default encoding instead
of always UTF-8 for the --text output, try running {{java -jar tika-app-1.1.jar test.doc}}
in a Windows command prompt using a document that contains non-ASCII content. By default the
output is correct, but if you explicitly set the encoding to UTF-8, the output gets garbled
by the command prompt window.
> UTF-8 encoding does not work on windows
> ---------------------------------------
>                 Key: TIKA-867
>                 URL: https://issues.apache.org/jira/browse/TIKA-867
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.0
>         Environment: Windows 7 Enterprise (Java 1.6.0_31) and MAC OS X 10.7.3 (Java 1.6.0_30)
>            Reporter: Wolfgang Außerlechner
>         Attachments: TIKA-867.patch
> When calling tika as command line tool from within java and parsing the output buffer
with UTF-8 (e.g. new String(buffer, 0, len, Charset.forName("UTF-8"));) behaviour on windows
is different than on mac os.
> On windows the encoding seems to be wrong (Währung vs. W?hrung). Other tools like exiftool
work as expected.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message