nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sujen Shah (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher
Date Mon, 18 May 2015 10:05:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547812#comment-14547812
] 

Sujen Shah commented on NUTCH-2011:
-----------------------------------

Hi [~wastl-nagel], 
Just to add a little to Asitang's reply, 

- "fetch round" means one fetch job, this basically corresponds to "bin/nutch fetch ...."
in the crawl script. 

- "greater depth fetch rounds" means longer fetch lists, corresponding to higher iteration
numbers specified in the noOfRounds parameter while running the bin/crawl script. 

As Asitang mentioned, the FetchNodeDb is used to make a D3 graph (currently a BFS tree) to
show the progress of the crawl, we would need the "round"(or iteration) in which it was fetched
to make the graph. 

There was some initial discussion about modifying the CrawlDb to hold one more parameter which
is the round number. But since a FetchNodeDb was created to store real-time information, the
idea of modifying the crawldb was dropped. 

One point from your comment on NUTCH-2015, the reason to store the FetchNodes in an enumerated
manner was so that the client could paginate his requests to reduce the amount of bandwidth
used. This was done to take care of client side failures in large crawls. This option is not
currently supported by any persistent databases used (CrawlDb/LinkDb, etc)


> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a real-time
JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer bw in large
crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message