tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: Metadata use by Apache Java projects
Date Mon, 19 Nov 2007 17:27:56 GMT
Hi Folks,
>> Sanselan and Tika have both chosen a very simple approach but is it
>> versatile enough for the future? While the simple Map<String, String[]> in
>> Tika allows for multiple authors, for example, it doesn't support
>> language alternatives for things such as dc:title or dc:description.
> IMHO it would be good to have a more flexible metadata model in Tika.
> Better yet if it's a standard used across multiple projects. Best if
> we don't need to implement it in Tika. :-)

I'm not quite sure I understand how Tika's metadata model isn't flexible
enough? Of course, I'm a bit bias, but I'm really trying to understand here
and haven't been able to. I think it's important to realize that a balance
must be struck between over-bloating a metadata library (and attaching on
RDF support, inference, synonym support, etc.) and making sure that the
smallest subset of it is actually useful.

Also, I'd be against moving Metadata support out of Tika because that was
one of the project's original goals (Metadata support), and I think it's
advantageous for Tika to be a provider for a Metadata capability (of course,
one related to document/content extraction).

I'm wondering too what it means that Tika doesn't support "language
alternatives"? Do you mean synonyms? Also, you mention it's relatively easy
in other libraries to map between different file format metadata. I think
that this is fairly easy to do in Tika too, seeing as though its primary
purpose is support metadata extraction from different file formats.

>> My questions:
>> - Any interest in converging on a unified model/approach?
> Certainly.


>> - If yes, where shall we develop this? As part of Tika (although it's
>> still in incubation)? As a seperate project (maybe as Apache Commons
>> subproject)? If more than XML Graphics uses this, XML Graphics is
>> probably not the right home.
>> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
>> the JempBox or XML Graphics Commons approach more interesting?
> If there already exists acceptably licensed good code outside the ASF,
> then I would prefer using that instead of reinventing the wheel within
> the foundation.

I'm not sure we're "re-inventing the wheel" here Jukka. Tika's Metadata
framework began in Nutch, and at the time based on a short survey that
Jerome Charron and I undertook, there was no easy-to-use, Metadata library
framework, that met the needs of the types of things done in Nutch/Tika --
document extraction of metadata from large corpuses, supporting many values
for keys: mapping between keys, etc. So, in my mind, we're definitely not
re-inventing any wheel and the framework was borne more out of need/ease of
use than anything else.

In any case, the use of a common framework is a good one to discuss and I'm
open to it. So long as people like me can better understand the gaps in the
current Tika Metadata framework and the benefits of addressing those gaps to
all the projects that would need it.



> BR,
> Jukka Zitting

Chris Mattmann, Ph.D.
Cognizant Development Engineer
Early Detection Research Network Project
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
Phone:  818-354-8810

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

View raw message