nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher
Date Fri, 15 May 2015 22:17:00 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel reopened NUTCH-2011:
------------------------------------

Sorry, but this needs some rework:
- after 35.000+ fetched pages and the default max. heap size of 1000M fetcher becomes slow
and throws mainly parser timeouts and catched OOM exceptions. Only small HTML pages with few
outlinks per page have been crawled - the limit is reached sooner if there are many overlong
outlinks or big PDF documents.
- why an in-memory "database" of page-related information (URL, title, outlinks + anchor texts)?
-- all information is available in CrawlDb, LinkDb, segments
-- MapReduce job counters provide instant progress information (e.g, number of fetched pages)
-- if required a queue of limited total size should be used
- in any case, this feature should be optional and off per default if NutchServer is not used
- "reporting" to FetchNodeDb is off if fetcher.parse is false (the default)? Is this intended?
Construction of FetchNodes is then useless work.
- no traces to System.out: "FetchNodeDb : putting node ..."

> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a real-time
JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer bw in large
crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message