nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sujen Shah (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher
Date Sat, 16 May 2015 08:59:59 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546650#comment-14546650
] 

Sujen Shah commented on NUTCH-2011:
-----------------------------------

Thank you for your inputs [~wastl-nagel]. 

I agree the feature should be optional and off by default and have submitted a PR to handle
that - https://issues.apache.org/jira/browse/NUTCH-2015

The FetchNodeDb is required a part of getting real-time information from the fetcher while
its running. Currently CrawlDb, LinkDb, segments give page related information only when the
entire fetch round is complete. Greater depth fetch rounds could go on for hours without giving
out any information about what pages are being crawled until its complete. To address this
issue a real-time reporting functionality was needed and hence the FetchNodeDb.  

Currently the database is "in-memory", but we are brainstorming ways to make it persistent
to reduce the memory usage. Some ideas are :
1. Write the FetchNodeDb to file after each fetch round (i.e at each depth level)
2. Keeping a threshold on the number of FetchNodes within the Db and then dumping onto a file
(similar to crawldb/linkdb)
What would be your suggestions to achieve the above ? 

Regarding reporting to FetchNodeDb while fetcher.parse is false is off. Initially this was
intended, as the output from the FetchNodeDb is going to be used to create a D3 graph which
would update dynamically. For the graph, we required the outlinks from a URL which we can
get only after parsing it. 
But you have correctly pointed out that the construction of FetchNodes is useless work when
fetcher.parse is off (default config). I will create a patch for this and submit a PR. 

Thanks

> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a real-time
JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer bw in large
crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message