tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremias Maerki <...@jeremias-maerki.ch>
Subject Re: Metadata use by Apache Java projects
Date Wed, 21 Nov 2007 07:52:06 GMT
Hi Chris

On 20.11.2007 18:06:25 Chris Mattmann wrote:
> Hi Jeremias,
> 
> >> I'm not quite sure I understand how Tika's metadata model isn't flexible
> >> enough? Of course, I'm a bit bias, but I'm really trying to understand here
> >> and haven't been able to. I think it's important to realize that a balance
> >> must be struck between over-bloating a metadata library (and attaching on
> >> RDF support, inference, synonym support, etc.) and making sure that the
> >> smallest subset of it is actually useful.
> > 
> > I'm sorry. I didn't intend to stand on anyone's toes.
> > 
> > At any rate, I'm not talking about full RDF support. I'm talking about
> > XMP, which uses only a subset of RDF.
> 
> Great, and I wouldn't worry about stepping on anyone's toes. You certainly
> didn't step on mine. My point was, at some point, we're just building
> libraries on top of libraries on top of...well you get the picture. What I'm
> interested in is building the smallest metadata library that's actually
> useful and can be built upon to add higher level capabilities, just as Solr
> builds on top of Lucene to provide faceted search, etc. Lucene itself
> doesn't provide a means for understanding facets/etc., but provides a
> library for text/indexing: Solr adds that understanding. Similarly here, I
> think it would be great for Tika to provide a library to handle Metadata
> representation/access, and then for others, to build on top of it to provide
> higher level library support (RDF access/etc.).

I think Adobe's XMP toolkit accomplishes exactly that, at least for the
generic part. Every project will certainly have some extra needs like
XML Graphics needs metadata merging and concrete adapters (like in my
previous example) for easier programming. Other projects might need
other tools, or the same. If we find common parts we can put those in a
little metadata library (Commons?!).

You keep saying that Tika should be providing a library to handle
Metadata representation/access. But is Tika really the right container?
Tika's goal is clearly metadata extraction while the requirements for
such a library go a little beyond that focus. I don't think I'd have a
hard time selling Tika with all its dependencies to the XML Graphics
project for just metadata handling (but not extraction). However, if
that library would be a separate product of the Tika project, fine. Then,
we only have the problem with Tika being in the incubator at the moment.
Can we use incubator releases in non-incubator projects? I don't really
know.

> > 
> >> Also, I'd be against moving Metadata support out of Tika because that was
> >> one of the project's original goals (Metadata support), and I think it's
> >> advantageous for Tika to be a provider for a Metadata capability (of course,
> >> one related to document/content extraction).
> > 
> > Metadata capability in the context of content extraction, certainly yes.
> > Nobody disputes that. But other projects have different needs (like
> > embedding metadata). So in all this there are certain common needs and
> > I'm trying to see if we can find a common ground in the form of a
> > uniform way of manipulating and storing metadata in memory while at the
> > same time working off a freely available standard.
> 
> Yep I get that. I'm all for that. Could you explain what you mean by
> "embedding" metadata? Within a document?

Again, an example is probably best: Document production in FOP.
Imagine a workflow, where some application generates XML files which are
formatted to PDF by FOP. The XSLT stylesheet will build up an XMP
packet besides the actual document content from the XML data that is
embedded in the fo:declarations element of the resulting XSL-FO document.
The PDFs are generated with the PDF/A-1b profile for long-term storage.
The PDFs go into a searchable archive, so metadata, especially
application-specific metadata (for example, patent bibliographic data
like a subset of ST.36 from WIPO for patent documents), needs to be
provided. During the formatting FOP needs to add its own metadata
(production time of the document, PDF producer, required PDF/A
indicators). That's where I do the merging: the XMP packet from XSL-FO
gets merged with a packet generated by FOP. The end result is an XMP
document that will be embedded in the PDF file.

> > 
> >> I'm wondering too what it means that Tika doesn't support "language
> >> alternatives"? Do you mean synonyms?
> > 
> [..snip..]
> >       <dc:title>
> >         <rdf:Alt>
> >           <rdf:li xml:lang="x-default">Manual</rdf:li>
> >           <rdf:li xml:lang="de">Bedienungsanleitung</rdf:li>
> >           <rdf:li xml:lang="fr">Mode d'emploi</rdf:li>
> >         </rdf:Alt>
> >       </dc:title>
> [..snip..]
> 
> > 
> > You can see that the title is available in three languages. The example
> > also shows the case with multiple authors.
> > 
> > To access the title using Adobe's XMP tookkit you'd do the following:
> > 
> > XMPMeta meta = XMPMetaFactory.parse(in);
> > String s;
> > 
> > //Get default title
> > s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, XMPConst.X_DEFAULT);
> > 
> > //Get title in user language if available
> > String userLang = System.getProperty("user.language");
> > s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, userLang);
> > 
> > Easy, isn't it? :-) That's the generic access to properties as Adobe's
> > XMP toolkit provides it. But it can also be useful to have concrete
> > adapters for easier use and higher type-safety. Here's what I do in XML
> > Graphics Commons at the moment:
> > 
> > Metadata meta = XMPParser.parseXMP(url);
> > DublinCoreAdapter dc = DublinCoreSchema.getAdapter(meta);
> > String s;
> > s = dc.getTitle();
> > String userLang = System.getProperty("user.language");
> > s = dc.getTitle(userLang);
> 
> Great example Jeremias. I think that the same type of thing could be built
> into Tika, and Tika currently supports some of the functionality that you
> mention above. Instead of meta.getLocalizedText, you could make a call to
> Tika like:
> 
> /* pseudo code of course */
> Metadata meta = new Metadata();
> TikaParser p = ParserFactory.createParser();
> ContentHandler hander;
> p.parse(stream, handler, meta);
> 
> String s;
> 
> s = meta.getMetadata(DublinCore.TITLE);
> 
> /* or if you want back all the titles parsed (if more than one) */
> List<String> titles = meta.getAllMetadata(DublinCore.TITLE);

Ah, so you do get multiple titles but you probably still lose the
information which title is in which language, right?

> So, then you could build a DublinCoreAdapter on top of Tika's Metadata class
> too.
> 
> >> Also, you mention it's relatively easy
> >> in other libraries to map between different file format metadata. I think
> >> that this is fairly easy to do in Tika too, seeing as though its primary
> >> purpose is support metadata extraction from different file formats.
> > 
> > No argument there. I don't claim I know all the requirements and use
> > cases of Tika. But I would imagine it's important to preserve as much
> > metadata as possible. XMP is certainly one of the best containers I've
> > seen to achieve that goal.
> 
> Yep exactly. That's one of the key requirements of Tika's Metadata
> framework. So yeah, long story short, it would be great to collaborate: I
> just want to make sure that there is proper understanding of all the pieces
> going forward so we know where there are gaps, and where there are not.

Me happy!

Jeremias Maerki


Mime
View raw message