tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Container Extractor?
Date Wed, 01 Sep 2010 14:18:52 GMT
+1, Nick, this sounds great...


On 9/1/10 2:54 AM, "Nick Burch" <nick.burch@alfresco.com> wrote:

Hi All

I've been thinking about extracting files from container formats (eg
images in a .docx, pdfs in a zip file etc). Given the recent number of
queries about embeded files and Tika lately, I was wondering if people
thought this might be something worth adding as another part of Tika?

My idea is that you'd pass to this "service" a container file. You'd also
say if you wanted recursion, and which mime types interest you. The result
would be say an iterator of input stream, which would probably also let
you get the filenames and mime types where supported by the container.

Example uses would be:
* .doc file, non recursive, request image/png and image/jpg
   gives you all the images in the word document
* .ppt file, recursive, request excel
   gives you excel files embeded in the powerpoint, and excel files embeded
   in the word documents embeded in the powerpoint
* .docx file, non recursive, request image/png
   treated as a ooxml file, not a plain zip file, and all png images
   from the magic embeded directory are returned.
* .zip file, recursive, request pdf
   gives you all pdf files anywhere in the zip
* .ogg file, non-recursive, request audio
   gives you the 3 different audio streams in your video file

You could pass the resultant input streams into the regular tika parser if
you wanted to process them, or even just save them into a directory
if all you wanted was an extractor.

What do people think? Is this useful? Is this appropriate for Tika? If yes
to these two, does the rough method signature sound sane?


PS I'm willing to do most of the coding on this if it's deemed suitable
    for Tika, but not for a few weeks probably, until Alfresco 3.4 is done

Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message