nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2335) Injector not to filter and normalize existing URLs in CrawlDb
Date Thu, 17 Aug 2017 08:46:03 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130105#comment-16130105
] 

Markus Jelsma commented on NUTCH-2335:
--------------------------------------

Sebastian, there is a problem with either this patch, or the 1.13 sources we have, and applied
this patch on. Without any settings, by default this patch will NOT filterNormalize injecting
URL's. But it will filterNormalize existing URL's. The opposite of what this patch was supposed
to do.

I have checked the code many times but it looks fine, it don't see the problem! But there
is. I even went so far as to put a out.println in filterNormalize to prove it to myself. This
is the console output of two runs on a emtpy crawldb:

{code}
markus@midas:~/projects/openindex/nutch/trunk/scripts/apache-nutch-1.13/runtime/local$ bin/nutch
inject crawl/crawldb urls/
Injector: starting at 2017-08-17 10:39:36
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 1
Injector: finished at 2017-08-17 10:39:37, elapsed: 00:00:01
markus@midas:~/projects/openindex/nutch/trunk/scripts/apache-nutch-1.13/runtime/local$ 
markus@midas:~/projects/openindex/nutch/trunk/scripts/apache-nutch-1.13/runtime/local$ bin/nutch
inject crawl/crawldb urls/
Injector: starting at 2017-08-17 10:40:31
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
normalize/filter: http://example.org/
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 1
Injector: Total new urls injected: 0
Injector: finished at 2017-08-17 10:40:33, elapsed: 00:00:01
{code}

I'll attach our local sources, but they are based on this latest patch without modification.

> Injector not to filter and normalize existing URLs in CrawlDb
> -------------------------------------------------------------
>
>                 Key: NUTCH-2335
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2335
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb, injector
>    Affects Versions: 1.12
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>             Fix For: 1.14
>
>         Attachments: Injector.java
>
>
> With NUTCH-1712 the behavior of the Injector has changed in case new URLs are added to
an existing CrawlDb:
> - before only injected URLs were filtered and normalized
> - now filters and normalizers are applied to all URLs including those already in the
CrawlDb
> The default should be as before not to filter existing URLs. Filtering and normalizing
may take long for large CrawlDbs and/or complex URL filters. If URL filter or normalizer rules
are not changed there is no need to apply them anew every time new URLs are added. Of course,
injected URLs should be filtered and normalized by default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message