tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject A plan to improve the metadata property definitions
Date Wed, 16 May 2012 15:50:47 GMT
Hi All

I've just been brainstorming with Ray Gauss, and we think we've come up 
with a way to move towards cleaner and clearer metadata property 
definitions (prefixes, properties with types etc), whilst maintaining 
backwards compatibility and avoiding too much work for parsers during 
the migration. It'll hopefully also help with the larger plan of 
improving the metadata, and make life easier for people working on that.

I'll use DublinCore as an example, but it's not the only one this'll 
apply to.

Today, we have all the keys from DublinCore imported onto the Metadata 
object, and all the parsers all call eg Metadata.DESCRIPTION rather than 
DublinCore.DESCRIPTION. This is a string key, not a property, so there's 
no information on it about type etc, and it's a raw key of "description" 
so people outside of the Java space (eg tika-cli users) don't know what 
it is defined as.

What I think we'd really like is for that to be a property, with type, 
with a key that includes our chosen prefix (so that tika-cli users etc 
know what it is), that doesn't break backwards compatibility until 2.0.

Additionally, we want to identify which properties are common, which all 
parsers should be mapping their metadata onto (eg everything should map 
the metadata that corresponds roughly to what Dublin Core explains 
Description to be, no matter what the format calls it), in addition from 
any format specific ones (which only advance users want)

We think we have a plan!

In order to avoid breaking backwards compatibility, we've looked and 
basically nothing uses the metadata key interfaces directly. Everything 
seems to use the Metadata one instead, eg Metadata.DESCRIPTION rather 
than DublinCore.DESCRIPTION. So, we think we can change the dublin core 
one, provided that Metadata is unchanged.

Step one is therefore to change all the definitions in Dublin Core to be 
proper properties. We copy over the old strings to Metadata, and 
@deprecate them (until 2.0). Everything should still work

Next, we define a class to hold the common Tika metadata properties. 
These are the ones we consider to be common across all formats, which 
parsers should be trying to populate wherever they can. (Most parsers 
already do this, eg for title or description). We'll do a few of these, 
but we'll need others to contribute to help decide the rest. These will 
be delegated out to a standard property that someone else has already 
defined, as we do now.

With that done, we can also specify some aliases, so that when you set 
one property it can be defined to also set some others. This allows us 
to say "when you set the new dublin core description, for now also go 
and set the old style description". This support will also be helpful 
for mappings on xmp aware (or similar) formats, to map between their 
custom properties and our common ones.

Finally, we go through the parsers and update them to set the new 
properties, rather than the old strings. They'll maintain compatibility 
for all users (those using the Java lookups, and those using raw keys eg 
tika-cli), and when we drop that in 2.0 the parsers don't need to change

We'll be opening issues for all of these, and doing the work in small 
chunks so everyone can follow. I believe this all fits with what 
everyone has been discussing for a while, doesn't break anything, and 
moves us forward. Despite the long email, it's actually quite small changes!

Nick

Mime
View raw message