tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith R. Bennett (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-40) Tika needs to support diverse character encodings.
Date Mon, 01 Oct 2007 23:02:50 GMT
Tika needs to support diverse character encodings.
--------------------------------------------------

                 Key: TIKA-40
                 URL: https://issues.apache.org/jira/browse/TIKA-40
             Project: Tika
          Issue Type: New Feature
          Components: general
    Affects Versions: 0.1-incubator
            Reporter: Keith R. Bennett
             Fix For: 0.1-incubator


Currently, the text parser implementation uses the default encoding of the Java runtime when
instantiating a Reader for the passed input stream.  We need to support other encodings as
well.  

It would be helpful to support the specification of an encoding in the parse method.  

Ideally, Tika would also provide the ability to determine the encoding automatically based
on the data stream.  (Unicode files may have byte order marks (http://unicode.org/faq/utf_bom.html#BOM),
but I don't know if other encodings can be inferred from content.)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message