nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Asitang Mishra (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher
Date Mon, 18 May 2015 09:43:00 GMT


Asitang Mishra commented on NUTCH-2011:

Hi [~wastl-nagel], 
-The answer to your first two questions is Yes, your interpretations are correct.
-Third question: The FetchNodeDb info will be used to make a D3 graph, that will in real time
give information of which page is being fetched, and if fetched properly, what outlinks it
generated. We need to output this as a visualization before the data is being written into
the segments. 
-I agree that we don't need an extra persistent layer as all the data is already stored segment
wise which is same as "round wise", me and [~chrismattmann] had discussed it before. 
- Although a buffer queue is an appealing idea, but we are not using it because we wanted
to make things more RESTful (so the user/graph can request pages from any to any index from
the temporary store/NodeDb or all the data from any previously updated specific segment).
Also, in case of a failure if the program requests the nodes again and the buffer queue does
not have it, then we will have to wait for the round to end and read it from the segment.
But, we can delve into [~wastl-nagel] 's idea if I guess some strict or cautionary measures
are taken at the client side :) . What do you think [~chrismattmann] and [~sujenshah]. 


> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>                 Key: NUTCH-2011
>                 URL:
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
> This fix will create an endpoint to query the Nutch REST service and get a real-time
JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer bw in large

This message was sent by Atlassian JIRA

View raw message