tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Lott (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-688) Enhance content-type detector to recognize almost plain text
Date Tue, 09 Aug 2011 18:56:27 GMT
Enhance content-type detector to recognize almost plain text
------------------------------------------------------------

                 Key: TIKA-688
                 URL: https://issues.apache.org/jira/browse/TIKA-688
             Project: Tika
          Issue Type: Improvement
          Components: mime
    Affects Versions: 0.9
            Reporter: Chris Lott
            Priority: Minor
             Fix For: 1.0


I am using TIKA to convert a collection of documents that includes files named something.txt.
 I use the Tika#parse(InputStream) interface to get a parser that auto detects content.  The
files are almost plain text -- the documents have a scattering of control characters in them.
 On these text files the reader given to me by the Tika#parse() method immediately returns
null.  After some experimentation I found that a single control K character early in the file
will cause the mime type detector to give up and label it application/octet-stream.  Please
consider adding a recognizer because it would be great if Tika could clean up these files
by dropping text characters.  I note that if I drop this file into the Tika GUI, or if I invoke
Tika on the command line it does well, and I think this behavior is obtained by using the
file name as a hint.  I probably should be using a different Tika method, trying to figure
that out next.  Thanks for listening.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message