tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: ISO 19115 as a metadata model for Tika?
Date Thu, 15 Oct 2015 12:47:40 GMT
> So this email is for discussion only - not for immediate action.
Got it.  As you can see by TIKA-1607 and [0], this has been an ongoing and important discussion,
and I appreciate your contributions...I'm not a standards person, and was interested to learn
more about ISO 19115.

> But approach 1 or c suggests that different conceptual models (e.g. Dublin core versus
ISO 19115) would co-exist.
Y, as I'm thinking more about c) (and note that this is a personal and half-baked proposal,
not at all speaking for the Tika community), we could offer multiple models for advanced users.
 If someone wanted to contribute code that would represent metadata in ISO 19115 for the appropriate
parsers or if we could scrape ISO-19115 out of documents (as we might consider doing with
XMP streams), the advanced user could grab that node and go to town.  To emphasize Nick's
point, we absolutely want to keep the basics easy to get to.  No single standard is likely
to be sufficient for us, and yet, we also don't want to create our very own.

Again, I can't emphasize enough the importance of Nick's point on keeping simple things simple.
 As SOLR-7232 shows, even our current model is not being used correctly by very important
consumers....I really need to get to work on that one...



[0] http://wiki.apache.org/tika/MetadataRoadmap

-----Original Message-----
From: Martin Desruisseaux [mailto:martin.desruisseaux@geomatys.com] 
Sent: Thursday, October 15, 2015 6:10 AM
To: dev@tika.apache.org
Subject: Re: ISO 19115 as a metadata model for Tika?

Le 14/10/15 20:15, Allison, Timothy B. a écrit :
> On TIKA-1607, there are two (and a half) proposals:
> 1) move everything to DOM with helper classes for common elements
> 2) use POJOs as metadata values
> c) ;) keep current setup, perhaps add binary values, use DOM inputstreams for things
that already have standards (e.g. Dublin core)  This could be a transitional step to option
1 in Tika 2.0.
> If we went with 1 or c) we could embed ISO 19115, we could either embed the info within
the DOM or add an ISO DOM stream that would include this information.

Thanks for explaining. But approach 1 or c suggests that different conceptual models (e.g.
Dublin core versus ISO 19115) would co-exist, regardless of the underlying data structure
(DOM or something else), is that right? For example, if someone what to get the title of a
document, does he would specify for example "I'm using the TITLE key from the Dublin core
key from the ISO 19115 model"? Or does Tika plans to propose its own "universal" model?

> (...snip...) However, once we move beyond Map<String, String[]> the 
> user is going to have to have some knowledge of the metadata structure 
> to extract information, whether that's POJO, DOM or Map<String, Node>.

Right, this is related to my question above. To avoid the need to know the metadata structure
of a specific data format, Tika (in my
understanding) currently maps some metadata to the Dublin core model, which is used as a "universal"
conceptual model. So anyone can ask for the title without knowing where the title is stored
in various data formats.

However for some more advanced needs, the Dublin core model is not enough and can not easily
be extended. A new conceptual model is needed.
ISO 19115 is one such conceptual model that could be used in replacement of Dublin core, but
there is also other conceptual models that are yet more complex than ISO 191115. Is there
some thoughts about what would be the compromise between simplicity and completeness in Tika

> On your interest in ISO 19115, to echo Nick, what specifically do you need? What document
formats do you see populating this information?

We do not need changes in Tika model at this time since Apache SIS has its own metadata engine
(but targeting only geospatial data formats like NetCDF - no Word or PDF parsing - and using
ISO 19115 as its "universal model" instead than Dublin core). But we have seen talks about
geospatial metadata in Tika in recent ApacheConf, and I was a little bit worried to see that
some proposed solutions (i.e. new properties) were Tika-specific instead than using international
standards (note: I'm not suggesting to use Apache SIS - only to consider the international
standard behind it).

So I'm not looking for a solution to a technical problem, but I'm trying to learn more about
the strategic direction that Tika wishes to take.
Would Tika considers to move to a richer metadata model than Dublin core? Would ISO 19115
be considered too geospatial-centric (which I could understand)? Would Tika supports more
than one "universal model"
if it wants to preserve Dublin core simplicity with the richness of other international standards?

About document formats populated with ISO 19115 metadata: standalone ISO
19115 files are provided by various data producers, for example 1) from NASA, 2) from the
Spanish mapping agency or 3) from all French government agencies:

 1. http://podaac.jpl.nasa.gov/ws/metadata/dataset/?shortName=AVISO_L4_DYN_TOPO_1DEG_1MO&format=iso
 2. http://www.ign.es/csw-inspire/srv/spa/xml_iso19139?id=9584
 3. http://www.geocatalogue.fr/getMetadata?format=XML&id=1785

ISO 19115 information are also embedded in raster data like "GML in JPEG2000" standard. Equivalent
information are embedded in NetCDF files and translated to the ISO 19115 model by tools like
"ncISO" from NOAA/NGDC. I saw that Tika has an org.apache.tika.metadata.ClimateForcast interface,
but it describes only the information at the root of NetCDF files without describing the variables
included in those files (which would need a metadata tree structure).

So this email is for discussion only - not for immediate action.



View raw message