nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fengtan (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1553) Property 'indexer.delete.robots.noindex' not working when using parser-html.
Date Fri, 27 May 2016 02:17:12 GMT


Fengtan commented on NUTCH-1553:

Also confirmed.
The cause seems to lie in HtmlParser: it parses meta robots directives and stores them in
a variable named 'metaTags', but it does not pass them to the Nutch metadata.
Attached is a suggested patch.

> Property 'indexer.delete.robots.noindex' not working when using parser-html.
> ----------------------------------------------------------------------------
>                 Key: NUTCH-1553
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer, parser
>    Affects Versions: 1.6
>            Reporter: Alfonso Presa
>            Priority: Minor
>         Attachments: NUTCH-1553-trunk-1.patch
> May be I'm doing something wrong, but it seems to me that +NUTCH-1434+ patch only works
when using tika's parser. When using parser-html, "robots" metatag is only populated if parse-metatags
plugin is enabled and it's done with the prefix "metatag.". So parseData.getMeta("robots")
returns nothing if not using tika.
> I guess the simplest solution would be to provide a fallback in case parseData.getMeta("robots")
is null and then get parseData.getMeta("metatag.robots") in that case.
> Also dependency of this property with parse-metadata plugin when using parse-html would
be something interesting to document somewhere... (nutch-default.xml?)
> Thanks!

This message was sent by Atlassian JIRA

View raw message