tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject Re: Container Extractor?
Date Thu, 02 Sep 2010 10:05:12 GMT
On Wed, 1 Sep 2010, Andrzej Bialecki wrote:
>> I was thinking recursive could mean different things. For zip files, tar
>> files etc, it would probably just mean root directory vs descend into
>> all directories.
>
> There are no directories in these formats - it's just a flat namespace 
> that just happens to use the filesystem conventions. Java APIs for these 
> containers also provide only simple iterators. So I'm not sure if 
> there's any benefit to this distinction here... maybe provide a 
> FilenameFilter to control what path names to process?

OK, looks like a directory descent on/off isn't a great fit.

I guess we'll want to provide two ways to filter, one by filename (which 
is normally available), and one by mime type (which is sometimes 
available). Or I guess a callback of "do you want this one?" where we pass 
in all the information we have to hand. Any thoughts?

> On the other hand I see a benefit in having an option to automatically 
> descend into embedded archives.

So we'd have some sort of filtering, and the descend yes/no option? For a 
zip, the former exposes all files from all "directories", and the latter 
will cause it to descend into both embeded zips, and embeded other 
containers like .doc? For a .docx, the former exposes all embeded files 
(but none of the ooxml file format stuff), and the latter controls if 
embeded other office documents are processed?

>> For OLE2, it would mean checking embeded documents of
>> embeded documents (normally but not always by means of descending into
>> child directories). Maybe there's a clearer name for this sort of thing?
>
> OLE2 is nothing special, it's the same with other archive types, you can 
> always have embedded archives within archives.

The OLE2 files aren't always so nice. Some store embeded files as 
directory entries, some stash them away in records...

Nick

Mime
View raw message