tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2554) Subtypes for common text formats currently included in text/plain
Date Thu, 25 Jan 2018 15:20:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339369#comment-16339369

Nick Burch commented on TIKA-2554:

I've pulled out Makefile and Config to their own {{text/x-}} types, as well as adding a few
more extensions from the SVN eol-style file. If everyone's happy with the approach, we can
pull out some more groupings fairly easily

> Subtypes for common text formats currently included in text/plain
> -----------------------------------------------------------------
>                 Key: TIKA-2554
>                 URL: https://issues.apache.org/jira/browse/TIKA-2554
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.17
>            Reporter: Nick Burch
>            Priority: Minor
> Currently, we have a very large number of file extension globs all feeding into the {{text/plain}}
mimetype. This includes not only variations on actual plain text, but also lots of other text-based
formats (eg config or makefile/autoconf files). This list dates back quite a while (TIKA-85
seems to have added most of them)
> While this simplifies things in Tika, it has the downside of making it very tricky for
people to add custom parsers for these text-based formats (eg [https://stackoverflow.com/questions/48411421/define-a-mime-type-for-txt-files-for-tika]
where they want to handle .cfg differently to other .txt)
> Because of how {{AutoDetectParser}} works, as long as there's no more specific parser
defined, if we create some new {{text/}} subtypes which extend {{text/plain}} then there won't
be any change in parsing behaviour. The only change would be for detection, where a more specific
type would be returned
> I therefore propose that we pull some of these (file-magic-less) globs out into other
{{text/}} mimetypes with a parent of {{text/plan}} , grouped roughly by type

This message was sent by Atlassian JIRA

View raw message