nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ron van der Vegt (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1815) Metadata Parsed with parse-tika is Duplicated
Date Fri, 09 Jan 2015 09:44:34 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ron van der Vegt updated NUTCH-1815:
------------------------------------
    Attachment: NUTCH-1815-1.9.patch

A small patch for 1.9 which will not add to the prefixed medata if something has already been
added there with the same key.


> Metadata Parsed with parse-tika is Duplicated
> ---------------------------------------------
>
>                 Key: NUTCH-1815
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1815
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer, parser
>    Affects Versions: 1.8
>            Reporter: Jonathan Cooper-Ellis
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.10
>
>         Attachments: NUTCH-1815-1.9.patch
>
>
> When Nutch is configured to parse metatags and index metadata from HTML documents, disabling
parse-html (and using parse-tika instead) causes each metadata field to be indexed twice with
identical content.
> I only modified plugin.includes (description and keywords metatags are included in nutch-site.xml
by default, so I did not modify those):
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>...</description>
> </property>
> Sample output:
> $ bin/nutch indexchecker http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> fetching: http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> parsing: http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> contentType: text/html
> content :	Commonwealth Fund survey: Obamacare helped 9.5 million Americans get health
insurance, thanks to exc
> title :	Commonwealth Fund survey: Obamacare helped 9.5 million Americans get health insurance,
thanks to exc
> host :	www.bizjournals.com
> tstamp :	Thu Jul 10 17:34:56 UTC 2014
> metatag.description :	A new survey by the Commonwealth Fund found that 9.5 million previously
uninsured Americans got cove
> metatag.description :	A new survey by the Commonwealth Fund found that 9.5 million previously
uninsured Americans got cove
> url :	http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-
> In this case, metatag.description appears twice. If parse-html is added back to plugin.includes
and the same command is run, metatag.description will only appear once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message