tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject Re: Container Extractor?
Date Thu, 02 Sep 2010 10:05:12 GMT
On Wed, 1 Sep 2010, Andrzej Bialecki wrote:
>> I was thinking recursive could mean different things. For zip files, tar
>> files etc, it would probably just mean root directory vs descend into
>> all directories.
> There are no directories in these formats - it's just a flat namespace 
> that just happens to use the filesystem conventions. Java APIs for these 
> containers also provide only simple iterators. So I'm not sure if 
> there's any benefit to this distinction here... maybe provide a 
> FilenameFilter to control what path names to process?

OK, looks like a directory descent on/off isn't a great fit.

I guess we'll want to provide two ways to filter, one by filename (which 
is normally available), and one by mime type (which is sometimes 
available). Or I guess a callback of "do you want this one?" where we pass 
in all the information we have to hand. Any thoughts?

> On the other hand I see a benefit in having an option to automatically 
> descend into embedded archives.

So we'd have some sort of filtering, and the descend yes/no option? For a 
zip, the former exposes all files from all "directories", and the latter 
will cause it to descend into both embeded zips, and embeded other 
containers like .doc? For a .docx, the former exposes all embeded files 
(but none of the ooxml file format stuff), and the latter controls if 
embeded other office documents are processed?

>> For OLE2, it would mean checking embeded documents of
>> embeded documents (normally but not always by means of descending into
>> child directories). Maybe there's a clearer name for this sort of thing?
> OLE2 is nothing special, it's the same with other archive types, you can 
> always have embedded archives within archives.

The OLE2 files aren't always so nice. Some store embeded files as 
directory entries, some stash them away in records...


View raw message