tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "david lemon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset
Date Wed, 03 Oct 2012 21:33:07 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

david lemon updated TIKA-1001:
------------------------------

    Description: 
attached document extracts correctly in Tika 1.1
attached document extracts incorrectly in tika 1.2.

The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies
the charset as iso-8859-6, and correctly converts the output to UTF-8.
tika 1.2 appears to ignore the charset specified in the meta tag.

Some noodling seems to indicate that the problem is the charset.

it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is
specified with a charset, the output is still garbage).


  was:
attached document extracts correctly in Tika 1.1
attached document extracts incorrectly in tika 1.2.

The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies
the charset as iso-8859-6, and correctly converts the output to UTF-8.
tika 1.2 appears to ignore the charset specified in the meta tag.

Some noodling seems to indicate that the problem is the charset.


    
> tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-1001
>                 URL: https://issues.apache.org/jira/browse/TIKA-1001
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>            Reporter: david lemon
>         Attachments: badarabic.html
>
>
> attached document extracts correctly in Tika 1.1
> attached document extracts incorrectly in tika 1.2.
> The difference appears to be that tika 1.1 honors the http meta content-type tag which
specifies the charset as iso-8859-6, and correctly converts the output to UTF-8.
> tika 1.2 appears to ignore the charset specified in the meta tag.
> Some noodling seems to indicate that the problem is the charset.
> it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type
is specified with a charset, the output is still garbage).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message