nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf>
Subject Re: MapReduce WebDB writer
Date Sat, 26 Mar 2005 14:12:03 GMT

> Actually I have already partially implemented a MapReduce WebDB
> writer. I don't know whether anyone else is working on this now.
Well, not working, but also spend some days with code reading and 
playing around.
However I'm very interested in helping as well, but I'm sure I'm some 
steps behind you.
> Comments?
In general I do not clearly understand the idea behind a "master" and 
the MapredWebDBCommitter.
Isn't this handled by the jobtracker and the job itself?
When browsing the Grep job then you can see that the grep job itself 
has the grepJob and sortJob, so you are able to manage 'flows' in the 
job itself.

Wouldn't make it sense to do the mr webdb similar?
As mentioned I just played around and may  be missed something, however 
I was thinking doit it like this:

* create inputformat for the segment file(s).
* writing a mapper that creates several small unsorted webdb's.
* writing a combiner  that merges this small webdb's with  the existing 
webdb in to a temp webdb.
* writing a reducer that is able to sort and merge the entries of the 
temp webdb.

As mentioned may be I missed something, but since the job itself is a 
kind of master the processes can be managed from the job.
Since all files would be written in a ndf folder that is unique it is 
may not necessary to have any kind of id.

Anyway I would love to see the code you mentioned to understand your 


View raw message