nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriele Kahlout (JIRA)" <>
Subject [jira] [Created] (NUTCH-972) Mergedb doesn't merge with empty directory, as is the case with merge (for indexes)
Date Sun, 27 Mar 2011 09:12:05 GMT
Mergedb doesn't merge with empty directory, as is the case with merge (for indexes)

                 Key: NUTCH-972
             Project: Nutch
          Issue Type: Bug
          Components: storage
    Affects Versions: 1.2
            Reporter: Gabriele Kahlout
            Priority: Minor
             Fix For: 1.3

Just an issue of unexpected behavior. This series of commands works with bin/nutch merge to
merge indexes but not with crawldb.

merge_dbs="$it_crawldb $allcrawldb"
#	if [[ ! -d $allcrawldb ]]
#	then
#		merge_dbs="$it_crawldb"
#	fi
# uncomment the above and mergedb will work fine.	
bin/nutch mergedb $temp_crawldb $merge_dbs	
rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
mv $temp_crawldb $allcrawldb

This is the exception that occurs:

bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
CrawlDb merge: starting at 2011-03-27 10:13:06
Adding crawl/crawldb
Adding crawl/allcrawldb
CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(
	at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(
	at org.apache.hadoop.mapred.JobClient.writeOldSplits(
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(
	at org.apache.hadoop.mapred.JobClient.submitJob(
	at org.apache.hadoop.mapred.JobClient.runJob(
	at org.apache.nutch.crawl.CrawlDbMerger.merge(
	at org.apache.nutch.crawl.CrawlDbMerger.main(

Beside the scripting workaround I've attached a patch which skips adding the empty folder
to the collection of dbs to merge. I've also added it a log of which dbs actually get added,
consistent with merge interface.

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message