tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-232) Scanning of archive files
Date Fri, 22 May 2009 21:59:45 GMT

    [ https://issues.apache.org/jira/browse/TIKA-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712283#action_12712283
] 

Jukka Zitting commented on TIKA-232:
------------------------------------

If you're instantiating the package parsers directly, then you can achieve this simply by
overriding the parser that is used for the files inside a package:

    PackageParser parser = ...;
    parser.setParser(new EmptyParser());

You could also use the following hack to do this for a pre-configured composite parser like
the AutoDetectParser:

    CompositeParser composite = new AutoDetectParser();
    for (Parser parser : composite.getParsers().values()) {
        if (Parser instanceof PackageParser) {
            ((PackageParser) parser).setParser(new EmptyParser());
        }
    }

Perhaps someone has a good idea how to make this easier?

> Scanning of archive files
> -------------------------
>
>                 Key: TIKA-232
>                 URL: https://issues.apache.org/jira/browse/TIKA-232
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.3
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>
> If i parse an archive all the files inside the archive will be extracted with their text
as well. It would be nice to have the choice to extract only the list of files (directory)
of an archive instead of extracting the whole contents. This seemed to be usable only for
zip, tar, tar.gz, tar.bz2, .jar. May be this could be realized by using a different calling
or by a run time configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message