nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Thoughts on Parser design and dependencies
Date Sat, 19 Aug 2006 09:54:29 GMT
Jukka Zitting wrote:
> Hi,
> On 8/19/06, Sami Siren <> wrote:
>> So far nutch has been build to deal mainly with text type documents.
>> There's however need also to deal with non textual object eg.  images,
>> movies, sound which will provide content only in form of metadata (ok,
>> perhaps some text also about the context of object if applicable), so
>> the metadata names we have today are only a subset of what might be.
>> I really would not want to restrict the metadata the interface can carry
>> to a fixed set.
> But if it's an open Map, how do you index and search using that, i.e.
> what is the mapping between the Map keys used by a parser component
> and the field names in the resulting Lucene index? How do we enforce
> that an MPEG parser uses the same Map keys as a JPEG parser when
> encountering metadata with the same semantics?
> I'm not opposed to using a Map for truly variable metadata, like HTML
> <meta/> tags with unknown names, but if we want common handling for
> example for Dublin Core metadata, it would be better to enforce that
> on the interface level.

Well, Nutch already does this in a way, but it's a "soft" endorsement 
rather than a hard enforcement .. ;) We define keys for all common 
metadata sets (DC, Office, HttpHeaders), and plugin writers are supposed 
to use them, unless they can't find any metadata key with matching 

Then, other indexing plugins expect certain metadata to be available 
under these keys, and create appropriate Lucene fields, again using 
predefined field names.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message