nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <>
Subject Don't use segments dir for crawl_generate /tmp file?
Date Tue, 08 Nov 2011 21:08:02 GMT

We've got a cron running checking for segments ready to fetch but we cannot 
reliably start fetching a generated segment without checking whether it's 
crawl_generate dir contains a tmp file. This means first checking for presence 
if a segment dir and then checking for the tmp file. From bash with hadoop 
this take quite a while so we prefer only to check on presence of a dir and 
then start the fetch.

We can either:
- modify the generator to move finished segments to another directory in which 
we know only fully generated segments are present;
- don't use the segment dir's crawl_generate tmp file and keep the tmp file in 
~ and move it when it's actually finished to the target dir.

Any thoughts? I prefer the latter approach - it's not uncommon for Nutch to 
write tmp files in ~.


View raw message