Hi All
I've just been brainstorming with Ray Gauss, and we think we've come up
with a way to move towards cleaner and clearer metadata property
definitions (prefixes, properties with types etc), whilst maintaining
backwards compatibility and avoiding too much work for parsers during
the migration. It'll hopefully also help with the larger plan of
improving the metadata, and make life easier for people working on that.
I'll use DublinCore as an example, but it's not the only one this'll
apply to.
Today, we have all the keys from DublinCore imported onto the Metadata
object, and all the parsers all call eg Metadata.DESCRIPTION rather than
DublinCore.DESCRIPTION. This is a string key, not a property, so there's
no information on it about type etc, and it's a raw key of "description"
so people outside of the Java space (eg tika-cli users) don't know what
it is defined as.
What I think we'd really like is for that to be a property, with type,
with a key that includes our chosen prefix (so that tika-cli users etc
know what it is), that doesn't break backwards compatibility until 2.0.
Additionally, we want to identify which properties are common, which all
parsers should be mapping their metadata onto (eg everything should map
the metadata that corresponds roughly to what Dublin Core explains
Description to be, no matter what the format calls it), in addition from
any format specific ones (which only advance users want)
We think we have a plan!
In order to avoid breaking backwards compatibility, we've looked and
basically nothing uses the metadata key interfaces directly. Everything
seems to use the Metadata one instead, eg Metadata.DESCRIPTION rather
than DublinCore.DESCRIPTION. So, we think we can change the dublin core
one, provided that Metadata is unchanged.
Step one is therefore to change all the definitions in Dublin Core to be
proper properties. We copy over the old strings to Metadata, and
@deprecate them (until 2.0). Everything should still work
Next, we define a class to hold the common Tika metadata properties.
These are the ones we consider to be common across all formats, which
parsers should be trying to populate wherever they can. (Most parsers
already do this, eg for title or description). We'll do a few of these,
but we'll need others to contribute to help decide the rest. These will
be delegated out to a standard property that someone else has already
defined, as we do now.
With that done, we can also specify some aliases, so that when you set
one property it can be defined to also set some others. This allows us
to say "when you set the new dublin core description, for now also go
and set the old style description". This support will also be helpful
for mappings on xmp aware (or similar) formats, to map between their
custom properties and our common ones.
Finally, we go through the parsers and update them to set the new
properties, rather than the old strings. They'll maintain compatibility
for all users (those using the Java lookups, and those using raw keys eg
tika-cli), and when we drop that in 2.0 the parsers don't need to change
We'll be opening issues for all of these, and doing the work in small
chunks so everyone can follow. I believe this all fits with what
everyone has been discussing for a while, doesn't break anything, and
moves us forward. Despite the long email, it's actually quite small changes!
Nick
|