tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: High Cohesion, Low Coupling
Date Sun, 17 Aug 2008 09:09:28 GMT

On Sat, Aug 16, 2008 at 10:52 PM, Keith R. Bennett <kbennett@bbsinc.biz> wrote:
> Do we intend to parse zip and tar files that contain multiple files?  I'll
> apologize in advance if we've already discussed this and I've forgotten.

See TIKA-149 where Dave Meikle has been helping us cross that bridge. :-)

> If so, I'm a little concerned that the code base might be made more
> difficult to maintain and extend, if we consider parsing these equivalent to
> parsing documents.  I think that unpacking these composite files is a task
> that is orthogonal to extracting text and metadata -- in fact, IMHO it would
> be better to use a word other than "parse" to refer to this action.

I disagree. The application/zip format is just another file format and
in a way it's nothing different from something like a Word or PDF
document with attachments in it.

I think composite files are well within the scope of Tika, while
things like crawling a file system or parsing an HTTP response are
clearly outside the scope.


Jukka Zitting

View raw message