tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: [jira] Commented: (TIKA-6) Port Nutch (or better) MimeType detection system into Tika
Date Thu, 20 Sep 2007 17:12:23 GMT

> The problem is not about the code that reads and interprets the
> database, but about the database (freedesktop.org.xml) and the related
> database description (freedesktop.org.dtd).
> If we wanted we could recreate both the database description (by
> reading the spec and writing our own DTD file) and even the database
> (by collecting vast amounts of content type information) under the
> Apache license, but AFAIK the current versions included in the patch
> are largely based on the GPL-licensed versions from freedesktop.org.
> So my suggestion would be to drop the xml and dtd files from the patch
> and replace them with configuration options for pointing the (Apache
> licensed) code to externally acquired database files.

+1 to this. Then, how about creating a separate issue to develop Tika's mime
type DTD (which as I read the entire discussion, is fine to be "based on"
the freedesktop one, however, should be our own) and a baseline mime type
database. What are your feelings for using the Nutch one as a starting point
for the Tika mime database? Again, I agree with you that this is a separate
issue w.r.t. to TIKA-6, but I thought I'd just ask your opinion now.

Thanks for the insight. I will update the TIKA-6 patch to include the other
issues you raised, and to remove the freedesktop.org.{dtd|xml} files.


Chris Mattmann, Ph.D.
Cognizant Development Engineer
Early Detection Research Network Project

Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

View raw message