tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2475) discrepancy between CharsetDetector APIs
Date Tue, 10 Oct 2017 21:15:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199367#comment-16199367

ASF GitHub Bot commented on TIKA-2475:

seanstory opened a new pull request #210: fix for TIKA-2475 contributed by seanstory
URL: https://github.com/apache/tika/pull/210
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> discrepancy between CharsetDetector APIs
> ----------------------------------------
>                 Key: TIKA-2475
>                 URL: https://issues.apache.org/jira/browse/TIKA-2475
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14, 1.15, 1.16
>         Environment: Mac OSX 10.12.6, Java 1.8.0_111
>            Reporter: Sean Story
>         Attachments: multi-language.txt
> h3. Problem
> I ran into this trying to use CharsetDetector to detect charsets of attachments on emails
when the mail client doesn't specify one. This used to work for us in tika 1.10, but in a
recent upgrade to 1.14, behavior seems to have changed. I've attached a sample file, whose
charset is ISO-8859-1, and was detected as such with Tika 1.10. When we updated our tika dependency,
we noticed that this sample data (a mix of English, Portuguese, and Spanish language) was
getting output as a lot of junk Chinese characters. Upon inspection, it was determined that
this was because our usage of the newer tika dep was detecting the file as UTF-16LE, instead
of ISO-8859-1.
> I've attached a sample file (multi-language.txt)
> Below is a Spock test that demonstrates the issue:
> {noformat}
>     def "test charset detection on multilingual file"(){
>         setup:
>         def file = new File("src/test/resources/data/multi-language.txt")
>         when: "using the InputStream api"
>         def detector = new CharsetDetector()
>         detector.setText(file.newInputStream())
>         def fileCharSet = detector.detect()
>         then: "successfully detects the charset"
>         fileCharSet.name.startsWith("ISO")
>         when: "using the byte[] api, and munging the input"
>         detector = new CharsetDetector()
>         detector.setText(file.newInputStream().bytes)
>         detector.MungeInput()
>         fileCharSet = detector.detect()
>         then: "sucessfully detects the charset"
>         fileCharSet.name.startsWith("ISO")
>         when: "using the byte[] api alone"
>         detector = new CharsetDetector()
>         detector.setText(file.newInputStream().bytes)
>         fileCharSet = detector.detect()
>         then: "this will fail - detects UTF-16LE instead"
>         fileCharSet.name.startsWith("ISO")
>     }
> {noformat}
> As is shown in the above test, I believe the issue is that the CharsetDetector's various
{{setText()}} functions do not delegate to one another, and in one the {{MungeInput()}} function
is called, and in the other it is not.

This message was sent by Atlassian JIRA

View raw message