tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2554) Subtypes for common text formats currently included in text/plain
Date Thu, 25 Jan 2018 14:04:00 GMT
Nick Burch created TIKA-2554:

             Summary: Subtypes for common text formats currently included in text/plain
                 Key: TIKA-2554
                 URL: https://issues.apache.org/jira/browse/TIKA-2554
             Project: Tika
          Issue Type: Improvement
          Components: mime
    Affects Versions: 1.17
            Reporter: Nick Burch

Currently, we have a very large number of file extension globs all feeding into the {{text/plain}}
mimetype. This includes not only variations on actual plain text, but also lots of other text-based
formats (eg config or makefile/autoconf files). This list dates back quite a while (TIKA-85
seems to have added most of them)

While this simplifies things in Tika, it has the downside of making it very tricky for people
to add custom parsers for these text-based formats (eg [https://stackoverflow.com/questions/48411421/define-a-mime-type-for-txt-files-for-tika]
where they want to handle .cfg differently to other .txt)

Because of how {{AutoDetectParser}} works, as long as there's no more specific parser defined,
if we create some new {{text/}} subtypes which extend {{text/plain}} then there won't be any
change in parsing behaviour. The only change would be for detection, where a more specific
type would be returned

I therefore propose that we pull some of these (file-magic-less) globs out into other {{text/}}
mimetypes with a parent of {{text/plan}} , grouped roughly by type

This message was sent by Atlassian JIRA

View raw message