tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shuai Liu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1437) encoding issue in AutoDetectReader
Date Sun, 05 Oct 2014 20:19:35 GMT
Shuai Liu created TIKA-1437:

             Summary: encoding issue in AutoDetectReader
                 Key: TIKA-1437
                 URL: https://issues.apache.org/jira/browse/TIKA-1437
             Project: Tika
          Issue Type: Bug
          Components: detector, parser
    Affects Versions: 1.6
         Environment: Windows 8
            Reporter: Shuai Liu
            Priority: Critical

We are having an encoding problem with Tika AutoDetectReader;
we are using AutoDetectReader to read an stream to extract the string values by calling readLine()::AutoDetectReader.
We find that the Encoding problem is happening in UniversalEncodingDetector being called by
AutoDetectReader when reading the input stream being passed as one of the arguments in our
TSVParser’s parse method. 
We are using AutoDetectReader in our parser and we believed it was able auto detect an correct
encoding from the input stream being passed to it, but we are seeing several garbled chars
bubbling up in our outputted and converted files from our parser; we find out that the encoding
problem is happening in the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader
is reading the stream with UTF-8 which is incorrect encoding; and the correct encoding is

I am attaching the screenshot of what I am talking about, the following is a raw tsv file;
you can see the hex code E9 is presented as a char between M and xico, I believe it is a ‘e’
but in different encoding/language.

The problem is that the AutoDetectReader is decoding and reading the chars with incorrect
BTW, We were able to work around this problem with CharsetDetector, which seems to generate
a valid encoding for the moment with which we can use to read the tsv file properly.

However, the problem is we cannot use AutoDetectReader, we have to create our own TSVAutoDetectReader
incorporated with CharsetDetector in the detect method; AutoDetectReader class seems to be
less flexible for us to extend its functions, many of its methods are restricted with private
constraints, we cannot manually set encoding or override the existing implementation for detecting

In addition, I am also not confident about CharsetDetector either; as I am seeing different
encodings produced by CharsetDetector and AutoDetectReader for different tsv files; But for
now, we might live with CharsetDetector, as CharsetDetector is solving the current encoding

Finally, I would like to please give you my test program (PFA: EncodingProblem.java) that
reads an inputted tsv directory and displays a list of encodings for each of the tsv files
in the directory produced by AutoDetectReader, UniversalEncodingDetector(which is being called
by AutoDetectReader) and CharsetDetector; so you could probably see the difference, they are
producing different encodings for some tsv files.

This message was sent by Atlassian JIRA

View raw message