nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yogendra Kumar Soni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2124) redirect following same link again and again , max redirect exceed and went db_gone
Date Mon, 05 Oct 2015 13:55:27 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943382#comment-14943382
] 

Yogendra Kumar Soni commented on NUTCH-2124:
--------------------------------------------

Hello Sebastian,
applied the patch, problem is still there. I have not done any investigation. I will get back
after finding the cause.
There are some more issues , some sites uses redirection for  getting sessionid (cookies)
and it may get redirected to domain that we don't know in advance and redirect back to original
url with session cookies. If we follow redirect till http status 200  and then apply url filters
when follow redirect is enabled these kind of sites can be crawled. 

> redirect following same link again and again , max redirect exceed and went db_gone
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-2124
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2124
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.11
>            Reporter: Yogendra Kumar Soni
>            Priority: Blocker
>              Labels: db_gone, fetcher, redirect
>             Fix For: 1.11
>
>         Attachments: NUTCH-2124.patch
>
>
> Hello, followredirect is not working in trunk. please see the below log.
> Fetcher: throughput threshold retries: 5
> fetcher.maxNum.threads can't be < than 50 : using 50 instead
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
> {color:red}
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl delay=5000ms)
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl delay=5000ms)
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
>  - redirect count exceeded http://www.wikipedia.com/wiki/URL_redirection
> {color}
> Thread FetcherThread has no more work available
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
> -activeThreads=0
> Fetcher: finished at 2015-09-28 19:32:05, elapsed: 00:00:09
> Parsing : 20150928193153



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message