tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2554) Subtypes for common text formats currently included in text/plain
Date Thu, 25 Jan 2018 16:00:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339424#comment-16339424

Hudson commented on TIKA-2554:

SUCCESS: Integrated in Jenkins build Tika-trunk #1425 (See [https://builds.apache.org/job/Tika-trunk/1425/])
TIKA-2554 Separate out Makefile from text/plain to a specific subtype (nick: [https://github.com/apache/tika/commit/db75e85fc9cf0d5f2c25f7eae2ff8deb59611b00])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
TIKA-2554 Separate out Config formats from text/plain to a specific (nick: [https://github.com/apache/tika/commit/1ba30ef32fee372566790650e9ab8a36bc9ab807])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml

> Subtypes for common text formats currently included in text/plain
> -----------------------------------------------------------------
>                 Key: TIKA-2554
>                 URL: https://issues.apache.org/jira/browse/TIKA-2554
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.17
>            Reporter: Nick Burch
>            Priority: Minor
> Currently, we have a very large number of file extension globs all feeding into the {{text/plain}}
mimetype. This includes not only variations on actual plain text, but also lots of other text-based
formats (eg config or makefile/autoconf files). This list dates back quite a while (TIKA-85
seems to have added most of them)
> While this simplifies things in Tika, it has the downside of making it very tricky for
people to add custom parsers for these text-based formats (eg [https://stackoverflow.com/questions/48411421/define-a-mime-type-for-txt-files-for-tika]
where they want to handle .cfg differently to other .txt)
> Because of how {{AutoDetectParser}} works, as long as there's no more specific parser
defined, if we create some new {{text/}} subtypes which extend {{text/plain}} then there won't
be any change in parsing behaviour. The only change would be for detection, where a more specific
type would be returned
> I therefore propose that we pull some of these (file-magic-less) globs out into other
{{text/}} mimetypes with a parent of {{text/plan}} , grouped roughly by type

This message was sent by Atlassian JIRA

View raw message