tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject Re: Container Extractor?
Date Mon, 06 Sep 2010 11:19:09 GMT
On Wed, 1 Sep 2010, Nick Burch wrote:
> I've been thinking about extracting files from container formats (eg 
> images in a .docx, pdfs in a zip file etc).

I've been pondering the various feedback over the weekend, and hopefully 
now have a more detailed idea.

Firstly, the new service needs to work for both people who have the 
container file locally, and those streaming it remotely. Some container 
parsers may work better with input streams, some with files, so making the 
input contract be a TikaInputStream would seem to be the right way around 
this?

Next, how to control which child elements are returned. The container will 
usually know the embeded file name, but not always, and will often know 
the path details of it (eg /foo/bar.txt in a zip file). It may sometimes 
know the mime type. This seems to me too difficult to easily represent as 
a wish-list filter. So, I now think that probably the only way to work it 
is to offer all the details of every file to the consumer, and let them 
decide if they're interested or not. Ideally, the amount of work done by 
the container parser until the consumer decides they want it + asks for 
the contents will be minimised. (A filter wrapper can always be put around 
it as required)

Nested embeded files - do we have a boolean flag for descend / don't 
descend, or do we pass that choice back to the consumer on a per-embeded 
basis similar to above? I worry that the latter would make things too 
complicated and heavy-weight, so I'm leaning towards the simple boolean 
flag.

Finally, pull vs push for the consumer. The two forms would probably look 
something like:
====
Iterator<Embeded> embeded = containerExtractor.extract(inp, false);
for(Embeded details : embeded) {
   if("application/pdf".equals(details.getMimeType()) ||
      "pdf".equals(details.getSuffix()) {
        handlePDF(details.getInputStream());
   }
   if("/README.txt".equals(details.getFilename()) {
        handleREADME(details.getInputStream());
   }
}
====
containerExtractor.extract(inp, false, new EmbededHandler() {
    public void handle(String filename, String mimetype, InputStreamSource
                           futureInputStream) {
        if("application/pdf".equals(mimetype) ||
               (filename != null && filename.endsWith("pdf"))) {
            handlePDF(futureInputStream.getInputStream());
        }
    }
});
====

I think the former would be a little bit more work for us, but is likely 
to lead to cleaner and simpler code for consumers. What do people think?

Nick

Mime
View raw message