tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremias Maerki <...@jeremias-maerki.ch>
Subject Re: Metadata use by Apache Java projects
Date Wed, 21 Nov 2007 07:28:09 GMT
Hi Antoni

Thanks for the interesting information. Frankly, you've scared me there
just a bit. It's interesting to see that there are so encompassing
efforts underway in some places. To me, full RDF still has a scare
factor. At least the subset XMP provides is "manageable" for mere
mortals. :-) At least, that's my impression. Maybe I still just know too
little about RDF. IMO, XMP finds a good compromise between
expressiveness and simplicity. The positive points for Adobe's XMP
toolkit: it is in Java, available now and under a license we can easily
use in Apache projects.

In your point 4, you mention some restrictions you see for XMP. But XMP
is a subset of RDF, so does RDF really restrict you from an RDF point of
view? I didn't really understand that point.

We'll see how this works out.

Jeremias Maerki

On 20.11.2007 15:25:44 Antoni Mylka wrote:
> Hi Jeremias, tika-dev
> My name is Antoni Mylka, I am involved in aperture.sourceforge.net,
> which is addressing similar things as Tika, we got your mail on the
> tika-dev mailing list. I also work for the Nepomuk Social Semantic
> Desktop project, I'm the maintainer of the Nepomuk Information Element
> Ontology. More below.
> Your mail addresses four more-or-less orthogonal issues.
> 1. The standardization of schemas, how the metadata should be
> represented i.e. URIs of classes and properties.
> 2. The standardzation of the representational language This means the
> conventions about how to use RDF (e.g. Bags, Seqs, Alts etc) and the
> formal semantics.
> 3. The standardization of the API that will work with the RDF triples
> and handle operations such as adding, deleting and querying triples.
> (And maybe the inference).
> 4. The standardization of the RDF storage mechanisms.
> XMP provides its answers to all these questions but they aren't the only
> ones. I know of at least two such standardization initiatives,
> 1. Freedesktop.org the XESAM project. A gathering of the major
> open-source desktop search engines
> http://xesam.org/main
> 2. Nepomuk Social Semantic Desktop Project. An EU-Funded research
> project with the Semantic-Web background.
> http://nepomuk.semanticdesktop.org
> Many of the issues you are bound to come into have already been
> recognized and some answers have been given, naturally the requirements
> might have been different and the solutions aren't optimal, but it may
> be interesting for you to skim through the output of those projects. To
> sum it up:
> 1.
> Freedesktop.org schema:
> <http://xesam.org/main/XesamOntology90>
> Nepomuk schema: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie/>
> Let the pointers take you from there.
> There is also an archive of discussions around the drafts of NIE. (there
> have been 8 at the moment).
> <http://dev.nepomuk.semanticdesktop.org/query?status=new&status=assigned&status=reopened&status=closed&component=ontology-nie&order=priority>
> 2.
> Freedesktop don't use any specific representational language, but they
> support property inheritance. They implement it by themselves, without
> any general-purpose RDF inference.
> Nepomuk uses the Nepomuk Representational Language. It has been
> considered better for our purposes, since it employs more intuitive
> semantics (so-called closed-world assumption, in normal RDF if you say
> that the value if nie:kisses property is a Human, and you write Antoni
> nie:kisses Frog - you can infer that the frog is a human, in NRL you can't)
> 3.
> No-one tried to standardize the API, there are many libraries that work
> with both in-memory and persistent RDF repositories.
> A few pointers:
> There are many APIs out there:
> * jena.sourceforge.net - big api for rdf by HP
> * www.openrdf.org - rdf api optimized for client/server setups
> * http://wiki.ontoworld.org/wiki/RDF2Go - Abstraction api of above
> There are many APIs generating "Schema Specific Adapters", the well
> known in Java are:
> * http://wiki.ontoworld.org/wiki/RDFReactor
> * elmo
> ** http://www.openrdf.net/doc/elmo/1.0/user-guide/index.html
> **
> http://sourceforge.net/project/showfiles.php?group_id=46509&package_id=157314
> * https://sommer.dev.java.net/
> from the above, elmo is quite stable and advanced.
> There are murmurs of standardization of RDF Apis,
> Max Völkel (FZI, Maintainer of RDF2Go), Henry Story (www.bblfish.net),
> and Leo Sauermann (DFKI, http://leobard.twoday.net) repeatedly thought
> about starting a JSR discussion on an RDF api, but that never happened.
> The W3C may be interested to do something like this (they did it for DOM
> I think and for XML, or?), the contact people would be the deployment group:
> http://www.w3.org/2006/07/SWD/
> so, to sum it up:
> There are many things out there handling RDF in Java, but nothing
> dominates yet as a single monopoly. In my sourroundings (my company,
> aperture.sourceforge.net) we prefer to use RDF2Go as "the api", its not
> perfect but it seems to work quite well.
> 4.
> XMP prescribes that the metadata be contained within the files
> themselves. There are many scenarios where this is a limitation. Each
> application will have to maintain its indexes by itself and possibly use
> a different API to work with XMP storage (in the files) and the common
> storage (e.g. an index). There is an ongoing effort to combine the
> flexibility of RDF with the search-capabilities of Lucene. Two of the
> more prominent ones are
> Sesame Lucene Sail
> <https://src.aduna-software.org/svn/org.openrdf/projects/sesame2-contrib/openrdf-sail-contrib/openrdf-lucenesail/>
> AFAIK there is no project page yet, but this idea has been worked on for
> at least two years now, e.g. in the gnowsis project
> www.gnowsis.org
> Boca TextIndexing feature
> Part of the IBM SLRP
> <http://ibm-slrp.sourceforge.net/wiki/index.php?title=BocaTextIndexing>
> In our opinion, such an initiative deserves at least a separate mailing
> list. We have already been working on metadata standardization for some
> time now and would be happy to help. Chris Mattman has written that it's
> necessary to strike a balance between functionality and over-bloating.
>  From my own experience i can say that it is VERY difficult :).
> Antoni Mylka
> antoni.mylka@gmail.com
> On Nov 19, 2007 10:26 AM, Jeremias Maerki <dev@jeremias-maerki.ch> wrote:
> > (I realize this is heavy cross-posting but it's probably the best way to
> > reach all the players I want to address.)
> >
> > As you may know, I've started developing an XMP metadata package inside
> > XML Graphics Commons in order to support XMP metadata (and ultimately
> > PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.
> >
> > What is XMP? XMP, for those who don't know about it, is based on a
> > subset of RDF to provide a flexible and extensible way of
> > storing/representing document metadata.
> >
> > Yesterday, I was surprised to discover that Adobe has published an XMP
> > Toolkit with Java support under the BSD license. In contrast to my
> > effort, Adobe's toolkit is quite complete if maybe a bit more
> > complicated to use. That got me thinking:
> >
> > Every project I'm sending this message to is using document metadata in
> > some form:
> > - Apache XML Graphics: embeds document metadata in the generated files
> > (just FOP at the moment, but Batik is a similar candidate)
> > - Tika (in incubation): has as one of its main purposes the extraction
> > of metadata
> > - Sanselan (in incubation): extracts and embeds metadata from/in bitmap
> > images
> > - PDFBox (incubation in discussion): extracts and embeds XMP metadata
> > from/in PDF files (see also JempBox)
> >
> > Every one of these projects has its own means to represent metadata in
> > memory. Wouldn't it make sense to have a common approach? I've worked
> > with XMP for some time now and I can say it's ideal to work with. It
> > also defines guidelines to embed XMP metadata in various file formats.
> > It's also relatively easy to map metadata between different file formats
> > (Dublin Core, EXIF, PDF Info etc.).
> >
> > Sanselan and Tika have both chosen a very simple approach but is it
> > versatile enough for the future? While the simple Map<String, String[]> in
> > Tika allows for multiple authors, for example, it doesn't support
> > language alternatives for things such as dc:title or dc:description.
> >
> > I'm seriously thinking about abandoning most of my XMP package work in
> > XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't
> > support, tough:
> > - Metadata merging functionality (which I need for synchronizing the PDF
> > Info object and the XMP packet for PDF/A)
> > - Schema-specific adapters (for Dublin Core and many other XMP Schemas) for
> > easier programming (which both Ben and I have written for JempBox and
> > XML Graphics Commons). Adobe's toolkit only allows generic access.
> >
> > Some links:
> > Adobe XMP website: http://www.adobe.com/products/xmp/
> > Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
> > JempBox: http://sourceforge.net/projects/jempbox
> > Apache XML Graphics Commons:
> >   http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
> >
> > My questions:
> > - Any interest in converging on a unified model/approach?
> > - If yes, where shall we develop this? As part of Tika (although it's
> > still in incubation)? As a seperate project (maybe as Apache Commons
> > subproject)? If more than XML Graphics uses this, XML Graphics is
> > probably not the right home.
> > - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> > the JempBox or XML Graphics Commons approach more interesting?
> > - Where's the best place to discuss this? We can't keep posting to
> > several mailing lists.
> >
> > At any rate, I would volunteer to spearhead this effort, especially
> > since I have immediate need to have complete XMP functionality. I've
> > almost finished mapping all XMP structures in XG Commons but I haven't
> > committed my latest changes (for structured properties) and I may still
> > not cover all details of XMP.
> >
> > Thanks for reading this far,
> > Jeremias Maerki
> >
> >
> -- 
> Antoni Myłka
> antoni.mylka@gmail.com

View raw message