nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Cocking (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1559) parse-metatags duplicates extracted metatags in combination with parse-tika
Date Thu, 30 Apr 2015 19:38:06 GMT


Jeff Cocking commented on NUTCH-1559:

In investigating this issue, it appears the is loading the info twice.
 There is code placed in MetaTagsParser to handle metatags not handled by Tika. 

The Tika plugin copies all the Tika metadata into the nutch metadata. (around line 184):
        // populate Nutch metadata with Tika metadata
        String[] TikaMDNames = tikamd.names();
        for (String tikaMDName : TikaMDNames) {
            if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
            // TODO what if multivalued?
            nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));

The MetaTagsParser is setup to parse both Tika metadata and Nutch metadata.  This is the reason
for the duplicate values. (around line 104)
    // check in the metadata first : the tika-parser
    // might have stored the values there already
    for (String mdName : metadata.names()) {
      addIndexedMetatags(metadata, mdName, metadata.getValues(mdName));

    Metadata generalMetaTags = metaTags.getGeneralTags();
    for (String tagName : generalMetaTags.names()) {
      addIndexedMetatags(metadata, tagName, generalMetaTags.getValues(tagName));

> parse-metatags duplicates extracted metatags in combination with parse-tika
> ---------------------------------------------------------------------------
>                 Key: NUTCH-1559
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.11
> If the plugin parse-metatags is used in combination with parse-tika, the extracted metatags
(the pairs <name, value>) are duplicated.
> The metatags are found twice in parse.getData().getParseMeta() and in metaTags.getGeneralTags().
Is this necessary? Maybe we should fix parse-tika in this point?

This message was sent by Atlassian JIRA

View raw message