tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: Metadata use by Apache Java projects
Date Tue, 20 Nov 2007 17:06:25 GMT
Hi Jeremias,

>> I'm not quite sure I understand how Tika's metadata model isn't flexible
>> enough? Of course, I'm a bit bias, but I'm really trying to understand here
>> and haven't been able to. I think it's important to realize that a balance
>> must be struck between over-bloating a metadata library (and attaching on
>> RDF support, inference, synonym support, etc.) and making sure that the
>> smallest subset of it is actually useful.
> I'm sorry. I didn't intend to stand on anyone's toes.
> At any rate, I'm not talking about full RDF support. I'm talking about
> XMP, which uses only a subset of RDF.

Great, and I wouldn't worry about stepping on anyone's toes. You certainly
didn't step on mine. My point was, at some point, we're just building
libraries on top of libraries on top of...well you get the picture. What I'm
interested in is building the smallest metadata library that's actually
useful and can be built upon to add higher level capabilities, just as Solr
builds on top of Lucene to provide faceted search, etc. Lucene itself
doesn't provide a means for understanding facets/etc., but provides a
library for text/indexing: Solr adds that understanding. Similarly here, I
think it would be great for Tika to provide a library to handle Metadata
representation/access, and then for others, to build on top of it to provide
higher level library support (RDF access/etc.).

>> Also, I'd be against moving Metadata support out of Tika because that was
>> one of the project's original goals (Metadata support), and I think it's
>> advantageous for Tika to be a provider for a Metadata capability (of course,
>> one related to document/content extraction).
> Metadata capability in the context of content extraction, certainly yes.
> Nobody disputes that. But other projects have different needs (like
> embedding metadata). So in all this there are certain common needs and
> I'm trying to see if we can find a common ground in the form of a
> uniform way of manipulating and storing metadata in memory while at the
> same time working off a freely available standard.

Yep I get that. I'm all for that. Could you explain what you mean by
"embedding" metadata? Within a document?

>> I'm wondering too what it means that Tika doesn't support "language
>> alternatives"? Do you mean synonyms?
>       <dc:title>
>         <rdf:Alt>
>           <rdf:li xml:lang="x-default">Manual</rdf:li>
>           <rdf:li xml:lang="de">Bedienungsanleitung</rdf:li>
>           <rdf:li xml:lang="fr">Mode d'emploi</rdf:li>
>         </rdf:Alt>
>       </dc:title>

> You can see that the title is available in three languages. The example
> also shows the case with multiple authors.
> To access the title using Adobe's XMP tookkit you'd do the following:
> XMPMeta meta = XMPMetaFactory.parse(in);
> String s;
> //Get default title
> s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, XMPConst.X_DEFAULT);
> //Get title in user language if available
> String userLang = System.getProperty("user.language");
> s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, userLang);
> Easy, isn't it? :-) That's the generic access to properties as Adobe's
> XMP toolkit provides it. But it can also be useful to have concrete
> adapters for easier use and higher type-safety. Here's what I do in XML
> Graphics Commons at the moment:
> Metadata meta = XMPParser.parseXMP(url);
> DublinCoreAdapter dc = DublinCoreSchema.getAdapter(meta);
> String s;
> s = dc.getTitle();
> String userLang = System.getProperty("user.language");
> s = dc.getTitle(userLang);

Great example Jeremias. I think that the same type of thing could be built
into Tika, and Tika currently supports some of the functionality that you
mention above. Instead of meta.getLocalizedText, you could make a call to
Tika like:

/* pseudo code of course */
Metadata meta = new Metadata();
TikaParser p = ParserFactory.createParser();
ContentHandler hander;
p.parse(stream, handler, meta);

String s;

s = meta.getMetadata(DublinCore.TITLE);

/* or if you want back all the titles parsed (if more than one) */
List<String> titles = meta.getAllMetadata(DublinCore.TITLE);

So, then you could build a DublinCoreAdapter on top of Tika's Metadata class

>> Also, you mention it's relatively easy
>> in other libraries to map between different file format metadata. I think
>> that this is fairly easy to do in Tika too, seeing as though its primary
>> purpose is support metadata extraction from different file formats.
> No argument there. I don't claim I know all the requirements and use
> cases of Tika. But I would imagine it's important to preserve as much
> metadata as possible. XMP is certainly one of the best containers I've
> seen to achieve that goal.

Yep exactly. That's one of the key requirements of Tika's Metadata
framework. So yeah, long story short, it would be great to collaborate: I
just want to make sure that there is proper understanding of all the pieces
going forward so we know where there are gaps, and where there are not.


Chris Mattmann, Ph.D.
Cognizant Development Engineer
Early Detection Research Network Project
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

View raw message