tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Container Extractor?
Date Tue, 07 Sep 2010 13:55:56 GMT
Hey Guys,

I've been following this discussion and one thing I'd like to add is that scientific data
formats exhibit most of the properties that the container formats do as well. For instance,
NetCDF does not support RandomAccess, and existing Java APIs to deal with those files require
the full file to be available on disk in order to be loaded into the class methods for extracting
information from the file. HDF is similar. So I'm going to follow this discussion a bit more
closely now as I see it coming closer to a concrete idea! ;) I've been watching the TikaInputStream
stuff that Jukka has been working on and I think that's a good starting point for addressing
some of these issues.


On 9/7/10 3:39 AM, "Nick Burch" <nick.burch@alfresco.com> wrote:

On Tue, 7 Sep 2010, Jukka Zitting wrote:
> On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <nick.burch@alfresco.com> wrote:
>> Finally, pull vs push for the consumer.
>> [...]
>> I think the former would be a little bit more work for us, but is likely to
>> lead to cleaner and simpler code for consumers. What do people think?
> I'd start with a push mechanism as that supports streaming and is
> better in line with the current design of Tika.

OK, that seems sensible to me, we'll go for a push option where you
specify a callback helper that'll be triggered for each embeded file. It'd
then be up to you to decide if you wanted the contents or not, based on
the filename and/or mime type.

In terms of fully streaming approach though, I'm not sure how easy it'll
be. Reviewing the different container formats, the extent that they'll be
streamable vs need buffering is:
* Tar (+compressed) - can be streamed
* Ogg / Avi / etc - different parts of the file are interlaced. If we
    support streaming, the callbacks would need to handle being run
    in parallel, which might add too much complexity for users?
* OLE2 - can't be streamed, we're going to have to buffer the whole file,
    load it into POIFS, and only then start returning things
* Zip - we'll need to do (at least) two passes. The first pass we'll look
    at what files it contains, and use that to figure out if it's
    .docx, keynote, open office etc, or just plain zip. If it's a plain
    zip, 2nd pass will return each file in turn. If it's a zip-based
    document format, filetype specific code will identify the embeded
    media for that format, and return each in turn.

I'd see this as meaning that you pass in a TikaInputStream to the service,
and a callback handler. If supported for the container, it will stream
through the file, firing the callback handler as it goes. For most cases,
the file will be buffered (to disk or memory as appropriate), the
appropriate bits identified, and then the callback handler fired for each


Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
Phone: +1 (818) 354-8810
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message