tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1437) encoding issue in AutoDetectReader
Date Mon, 06 Oct 2014 14:05:33 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160312#comment-14160312
] 

Tim Allison commented on TIKA-1437:
-----------------------------------

No encoding detector will be perfect.  

Are you sure that the encoding of the attached is not UTF-8?  Internet explorer "guesses"
ISO-8859-1, which is clearly not right.  When I tell IE to use UTF-8, the accented characters
are correctly displayed.

You can set which encoding detectors to use in the services file:

tikaparsers/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector

IIRC, the current algorithm runs HTMLEncodingDetector then (if no encoding is found) UniversalEncodingDetector
then (if no encoding is found) Icu4jEncodingDetector.

You can reorder these components or enable your own via the services file.

I agree that we should better document how to do this.  

Finally, are you sure that you should be sharing the attached file with the world?

> encoding issue in AutoDetectReader
> ----------------------------------
>
>                 Key: TIKA-1437
>                 URL: https://issues.apache.org/jira/browse/TIKA-1437
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 1.6
>         Environment: Windows 8
>            Reporter: Shuai Liu
>            Priority: Critical
>         Attachments: EncodingProblem.java, computrabajo-ar-20121108.tsv, e9.jpg, ef.jpg
>
>
> We are having an encoding problem with Tika AutoDetectReader;
> we are using AutoDetectReader to read an stream to extract the string values by calling
readLine()::AutoDetectReader. We find that the Encoding problem is happening in UniversalEncodingDetector
being called by AutoDetectReader when reading the input stream being passed as one of the
arguments in our TSVParser’s parse method. 
> We are using AutoDetectReader in our parser and we believed it was able auto detect an
correct encoding from the input stream being passed to it, but we are seeing several garbled
chars bubbling up in our outputted and converted files from our parser; we find out that the
encoding problem is happening in the UniversalEncodingDetector, which returns an UTF-8 and
AutoDetectReader is reading the stream with UTF-8 which is incorrect encoding; and the correct
encoding is ISO-8859-1.
> I am attaching the screenshot of what char difference we are seeing in the input tsv
file and converted/outputed file. they are e9.jpg and ef.jpg, please read the description
for details.
> The problem is that the AutoDetectReader is decoding and reading the chars with incorrect
encoding. 
> BTW, We were able to work around this problem with CharsetDetector, which seems to generate
a valid encoding for the moment with which we can use to read the tsv file properly.
> However, the problem is we cannot use AutoDetectReader, we have to create our own TSVAutoDetectReader
incorporated with CharsetDetector in the detect method; AutoDetectReader class seems to be
less flexible for us to extend its functions, many of its methods are restricted with private
constraints, we cannot manually set encoding or override the existing implementation for detecting
encoding.
> In addition, I am also not confident about CharsetDetector either; as I am seeing different
encodings produced by CharsetDetector and AutoDetectReader for different tsv files; But for
now, we might live with CharsetDetector, as CharsetDetector is solving the current encoding
problem.
> Finally, I would like to also please give you my test program (PFA: EncodingProblem.java)
that reads an inputted tsv directory and displays a list of encodings for each of the tsv
files in the directory produced by AutoDetectReader, UniversalEncodingDetector(which is being
called by AutoDetectReader) and CharsetDetector; so you could probably see the difference,
they are producing different encodings for some tsv files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message