tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Jakubik <p...@purediscovery.com>
Subject Re: Packages and attributes
Date Thu, 15 Jul 2010 23:43:28 GMT
On Thu, Jul 15, 2010 at 6:43 AM, Jukka Zitting <jukka.zitting@gmail.com>wrote:

> The way I recommend is to pass a custom Parser implementation through
> the ParseContext. This gives you detailed access to each component
> document.
>
>
I looked at the code a little further, and I don't see exactly how I can do
this.

I am using an AutoDetectParser, and in my ParseContext I've placed
another AutoDetectParser. At the top level I might be parsing a tar.gz,
and inside this tar.gz there are text, PDF, and zip files.

As far as I can tell, when I start to parse files embedded in one of the
containers (tar.gz or zip), it is actually PackageExtractor that gets the
parser from the ParseContext, and it is also PackageExtractor that
creates a new Metadata object that it doesn't share, thus keeping me
from being able to look at the metadata.

Does this mean that, to get access to the metadata for subdocuments
I would need to do the following:
* Create a replacements for PackageParser and PackageExtractor
  that do what I want with the metadata
* use get parsers and set parsers on the AutoDetectParser, and
  replace the parser for each of the following MediaTypes

                MediaType.application("x-archive"),
                MediaType.application("x-bzip"),
                MediaType.application("x-bzip2"),
                MediaType.application("x-cpio"),
                MediaType.application("x-gtar"),
                MediaType.application("x-gzip"),
                MediaType.application("x-tar"),
                MediaType.application("zip"))));

I wonder if it would be easier to update PackageExtractor to check if
there is a metadata stack in the ParseContext, and if so, push the
new metadata object just before parsing a subdocument, and pop the
the metadata object just after the parse (maybe just after writing the
end of the <div> section.

Paul

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message