tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antoni Mylka" <antoni.my...@gmail.com>
Subject Re: Metadata use by Apache Java projects
Date Tue, 20 Nov 2007 14:25:44 GMT
Hi Jeremias, tika-dev

My name is Antoni Mylka, I am involved in aperture.sourceforge.net,
which is addressing similar things as Tika, we got your mail on the
tika-dev mailing list. I also work for the Nepomuk Social Semantic
Desktop project, I'm the maintainer of the Nepomuk Information Element
Ontology. More below.

Your mail addresses four more-or-less orthogonal issues.

1. The standardization of schemas, how the metadata should be
represented i.e. URIs of classes and properties.

2. The standardzation of the representational language This means the
conventions about how to use RDF (e.g. Bags, Seqs, Alts etc) and the
formal semantics.

3. The standardization of the API that will work with the RDF triples
and handle operations such as adding, deleting and querying triples.
(And maybe the inference).

4. The standardization of the RDF storage mechanisms.

XMP provides its answers to all these questions but they aren't the only
ones. I know of at least two such standardization initiatives,

1. Freedesktop.org the XESAM project. A gathering of the major
open-source desktop search engines
http://xesam.org/main

2. Nepomuk Social Semantic Desktop Project. An EU-Funded research
project with the Semantic-Web background.
http://nepomuk.semanticdesktop.org

Many of the issues you are bound to come into have already been
recognized and some answers have been given, naturally the requirements
might have been different and the solutions aren't optimal, but it may
be interesting for you to skim through the output of those projects. To
sum it up:

1.
Freedesktop.org schema:
<http://xesam.org/main/XesamOntology90>

Nepomuk schema: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie/>
Let the pointers take you from there.
There is also an archive of discussions around the drafts of NIE. (there
have been 8 at the moment).
<http://dev.nepomuk.semanticdesktop.org/query?status=new&status=assigned&status=reopened&status=closed&component=ontology-nie&order=priority>

2.
Freedesktop don't use any specific representational language, but they
support property inheritance. They implement it by themselves, without
any general-purpose RDF inference.

Nepomuk uses the Nepomuk Representational Language. It has been
considered better for our purposes, since it employs more intuitive
semantics (so-called closed-world assumption, in normal RDF if you say
that the value if nie:kisses property is a Human, and you write Antoni
nie:kisses Frog - you can infer that the frog is a human, in NRL you can't)

3.
No-one tried to standardize the API, there are many libraries that work
with both in-memory and persistent RDF repositories.

A few pointers:

There are many APIs out there:
* jena.sourceforge.net - big api for rdf by HP
* www.openrdf.org - rdf api optimized for client/server setups
* http://wiki.ontoworld.org/wiki/RDF2Go - Abstraction api of above

There are many APIs generating "Schema Specific Adapters", the well
known in Java are:
* http://wiki.ontoworld.org/wiki/RDFReactor
* elmo
** http://www.openrdf.net/doc/elmo/1.0/user-guide/index.html
**
http://sourceforge.net/project/showfiles.php?group_id=46509&package_id=157314
* https://sommer.dev.java.net/

from the above, elmo is quite stable and advanced.

There are murmurs of standardization of RDF Apis,
Max Völkel (FZI, Maintainer of RDF2Go), Henry Story (www.bblfish.net),
and Leo Sauermann (DFKI, http://leobard.twoday.net) repeatedly thought
about starting a JSR discussion on an RDF api, but that never happened.
The W3C may be interested to do something like this (they did it for DOM
I think and for XML, or?), the contact people would be the deployment group:
http://www.w3.org/2006/07/SWD/

so, to sum it up:
There are many things out there handling RDF in Java, but nothing
dominates yet as a single monopoly. In my sourroundings (my company,
aperture.sourceforge.net) we prefer to use RDF2Go as "the api", its not
perfect but it seems to work quite well.

4.
XMP prescribes that the metadata be contained within the files
themselves. There are many scenarios where this is a limitation. Each
application will have to maintain its indexes by itself and possibly use
a different API to work with XMP storage (in the files) and the common
storage (e.g. an index). There is an ongoing effort to combine the
flexibility of RDF with the search-capabilities of Lucene. Two of the
more prominent ones are

Sesame Lucene Sail
<https://src.aduna-software.org/svn/org.openrdf/projects/sesame2-contrib/openrdf-sail-contrib/openrdf-lucenesail/>
AFAIK there is no project page yet, but this idea has been worked on for
at least two years now, e.g. in the gnowsis project
www.gnowsis.org

Boca TextIndexing feature
Part of the IBM SLRP
<http://ibm-slrp.sourceforge.net/wiki/index.php?title=BocaTextIndexing>

In our opinion, such an initiative deserves at least a separate mailing
list. We have already been working on metadata standardization for some
time now and would be happy to help. Chris Mattman has written that it's
necessary to strike a balance between functionality and over-bloating.
 From my own experience i can say that it is VERY difficult :).

Antoni Mylka
antoni.mylka@gmail.com

On Nov 19, 2007 10:26 AM, Jeremias Maerki <dev@jeremias-maerki.ch> wrote:
> (I realize this is heavy cross-posting but it's probably the best way to
> reach all the players I want to address.)
>
> As you may know, I've started developing an XMP metadata package inside
> XML Graphics Commons in order to support XMP metadata (and ultimately
> PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.
>
> What is XMP? XMP, for those who don't know about it, is based on a
> subset of RDF to provide a flexible and extensible way of
> storing/representing document metadata.
>
> Yesterday, I was surprised to discover that Adobe has published an XMP
> Toolkit with Java support under the BSD license. In contrast to my
> effort, Adobe's toolkit is quite complete if maybe a bit more
> complicated to use. That got me thinking:
>
> Every project I'm sending this message to is using document metadata in
> some form:
> - Apache XML Graphics: embeds document metadata in the generated files
> (just FOP at the moment, but Batik is a similar candidate)
> - Tika (in incubation): has as one of its main purposes the extraction
> of metadata
> - Sanselan (in incubation): extracts and embeds metadata from/in bitmap
> images
> - PDFBox (incubation in discussion): extracts and embeds XMP metadata
> from/in PDF files (see also JempBox)
>
> Every one of these projects has its own means to represent metadata in
> memory. Wouldn't it make sense to have a common approach? I've worked
> with XMP for some time now and I can say it's ideal to work with. It
> also defines guidelines to embed XMP metadata in various file formats.
> It's also relatively easy to map metadata between different file formats
> (Dublin Core, EXIF, PDF Info etc.).
>
> Sanselan and Tika have both chosen a very simple approach but is it
> versatile enough for the future? While the simple Map<String, String[]> in
> Tika allows for multiple authors, for example, it doesn't support
> language alternatives for things such as dc:title or dc:description.
>
> I'm seriously thinking about abandoning most of my XMP package work in
> XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't
> support, tough:
> - Metadata merging functionality (which I need for synchronizing the PDF
> Info object and the XMP packet for PDF/A)
> - Schema-specific adapters (for Dublin Core and many other XMP Schemas) for
> easier programming (which both Ben and I have written for JempBox and
> XML Graphics Commons). Adobe's toolkit only allows generic access.
>
> Some links:
> Adobe XMP website: http://www.adobe.com/products/xmp/
> Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
> JempBox: http://sourceforge.net/projects/jempbox
> Apache XML Graphics Commons:
>   http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
>
> My questions:
> - Any interest in converging on a unified model/approach?
> - If yes, where shall we develop this? As part of Tika (although it's
> still in incubation)? As a seperate project (maybe as Apache Commons
> subproject)? If more than XML Graphics uses this, XML Graphics is
> probably not the right home.
> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> the JempBox or XML Graphics Commons approach more interesting?
> - Where's the best place to discuss this? We can't keep posting to
> several mailing lists.
>
> At any rate, I would volunteer to spearhead this effort, especially
> since I have immediate need to have complete XMP functionality. I've
> almost finished mapping all XMP structures in XG Commons but I haven't
> committed my latest changes (for structured properties) and I may still
> not cover all details of XMP.
>
> Thanks for reading this far,
> Jeremias Maerki
>
>



-- 
Antoni Myłka
antoni.mylka@gmail.com
Mime
View raw message