nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Iyer (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-1892) Update the FileDumper tool to fetch only those URLs with status db_fetched in nutch
Date Wed, 26 Nov 2014 21:48:12 GMT
Prasanth Iyer created NUTCH-1892:
------------------------------------

             Summary: Update the FileDumper tool to fetch only those URLs with status db_fetched
in nutch
                 Key: NUTCH-1892
                 URL: https://issues.apache.org/jira/browse/NUTCH-1892
             Project: Nutch
          Issue Type: Improvement
          Components: nutchNewbie
    Affects Versions: 2.2.1
            Reporter: Prasanth Iyer


The FileDumper tool is a tool that reads the crawled data from Nutch and dumps this data into
its raw files. This tool currently dumps every single file irrespective of status, duplicates
etc. This cause files that are fetched in error or files that have not been fetched because
they were made unavailable by the server to also be dumped. 

The fix should be to fetch only those files that were fetched with status db_fetched by Nutch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message