nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Cherniachenko (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1525) Generator to record external links even when db.ignore.external.links set to true
Date Fri, 14 Feb 2014 09:26:19 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitry Cherniachenko updated NUTCH-1525:
----------------------------------------

    Attachment: nutch-logExternal.patch

Attached the patch for Nutch 1.7

With it applied you can add the following to log4j.properties
{code}
log4j.logger.org.apache.nutch.parse.ParseOutputFormat.externalLinks=INFO,extlinks

log4j.appender.extlinks=org.apache.log4j.DailyRollingFileAppender
log4j.appender.extlinks.File=${hadoop.log.dir}/external-links.log
log4j.appender.extlinks.DatePattern=.yyyy-MM-dd
log4j.appender.extlinks.layout=org.apache.log4j.PatternLayout
log4j.appender.extlinks.layout.ConversionPattern=%m%n
{code}

And then all the ignored external links will be logged cleanly to external-links.log

> Generator to record external links even when  db.ignore.external.links set to true
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-1525
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1525
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: nutch-logExternal.patch
>
>
> When fetching pages from specific domains we have various options e.g. use urlfilters,
set the above property to true before injecting urls into the webdb etc. However with the
former, it is recognised that complex regex can slow down processing and with the latter it
means we disregard a number of urls which could potentially become useful in the future.
> Unfortunately there is no way to record external links encountered for future processing,
although the wiki suggests that a very small patch to the generator code can allow you to
log these links to hadoop.log. although this is better, a more robusts storage mechanism would
be preferred. This may tie in with custom counters we've already specified or may require
new counters to be implemented.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message