nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Updated: (NUTCH-451) Tool to recover partial fetcher output
Date Mon, 26 Feb 2007 19:44:06 GMT


Andrzej Bialecki  updated NUTCH-451:


> Tool to recover partial fetcher output
> --------------------------------------
>                 Key: NUTCH-451
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>         Attachments:
> This class may help you to recover partial data from a failed Fetcher run. 
> NOTE 1: this works ONLY if you ran Fetcher using "local" file system, i.e. you didn't
use DFS - partial output to DFS is permanently lost if a process fails to properly close the
output streams.
> NOTE 2: if Fetcher was stopped abruptly (killed or crashed), then partial SequenceFile-s
will be corrupted at the end. This means that it won't be possible to recover all data from
them - most likely only the data up to the last sync marker can be recovered.
> The recovery proces requires some preparation: 
> * determine the map directories corresponding to the map task outputs of the failed job.
These map directories contain SequenceFile-s consisting of pairs of <Text, FetcherOutput>,
named e.g. part-0.out, or file.out, or spill0.out.
> * create the new input directory, let's say input/. Copy all SequenceFile-s into this
directory, renaming them sequentially like this: 
>   input/part-00000
>   input/part-00001
>   input/part-00002
>   input/part-00003
>   ...
> * specify the "input" directory as the input to this tool. 
> If all goes well, a new segment will be created as a subdirectory of the output dir.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message