nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uros Gruber (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-381) Ignore external link not work as expected
Date Thu, 05 Oct 2006 19:35:41 GMT
Ignore external link not work as expected
-----------------------------------------

                 Key: NUTCH-381
                 URL: http://issues.apache.org/jira/browse/NUTCH-381
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.8.1
            Reporter: Uros Gruber
            Priority: Critical


Currently there is no way to properly limit fetcher without regexp rules we use ignore.external.link
option but It seams that It doesn't work in all cases.
Here is example urls I'm seeing but

cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. 

fetching http://help.yahoo.com/help/sports
fetching http://www.turkish-xxx.com/adult-traffic-trade.php
fetching http://help.yahoo.com/help/us/astr/
fetching http://www.polish-xxx.com/de-index.html
fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
fetching http://help.yahoo.com/help/groups
fetching http://help.yahoo.com/help/fin/
fetching http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
fetching http://help.yahoo.com/help/us/edit/
fetching http://www.polish-xxx.com/es-index.html

Anyone notice this?

I assume that there must be something with expired domains where pages generates randomly.
But still why urls from other domain was added. Maybe urlregexp filter +* exclude.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message