nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@media-style.com>
Subject Re: [Nutch-dev] Re: MapReduce WebDB writer
Date Sun, 27 Mar 2005 16:49:54 GMT
Hi Feng,

after reading your code I think I more and more understand your idea.
Sorry for being so slow. :-)
To summarize with my words: tools will write edits with a commit id, 
edits are reduced via mr and merged with the webdb.
Right?

However wouldn't be useful to have tools write small independent edit 
files that can be merged with the web db any time?
It would provide a less coupling of tools and web db. The disk space is 
the same in any way.

Anyway from my poor knowledge your code much makes sense. I suggest not 
changing the webdb interface but duplicate it and change the new.
I think to break the code of tools isn't a good idea.

Stefan




Am 26.03.2005 um 23:55 schrieb Feng Zhou:

> Hi Stefan,
>
> I've posted the code at
> http://security-gate.cs.berkeley.edu/~zf/nutch/mrdb-test.tar.gz. It
> won't compile because I changed a few other bits of the mapreduce
> code. But it should be enough for explanatory purpose.
>
>> In general I do not clearly understand the idea behind a "master" and
>> the MapredWebDBCommitter.
>> Isn't this handled by the jobtracker and the job itself?
>> When browsing the Grep job then you can see that the grep job itself
>> has the grepJob and sortJob, so you are able to manage 'flows' in the
>> job itself.
>
> By "master" I mean the node starting the MapReduce process, i.e.
> calling JobClient.runJob(). Sorry I didn't explain it (it's from the
> mapreduce paper). The reason to add another class is that both the
> master and workers needs a way to reference the generic webdb writer.
> In it's current form, master will access the committer and workers
> will access their respective writer. Certainly this breaks the IWebDB
> contract. But it seems still close enough.
>
>>
>> * create inputformat for the segment file(s).
>> * writing a mapper that creates several small unsorted webdb's.
>> * writing a combiner  that merges this small webdb's with  the 
>> existing
>> webdb in to a temp webdb.
>> * writing a reducer that is able to sort and merge the entries of the
>> temp webdb.
>
> To understand you better, the "segment files" that inputformat reads
> refers to fetch results, right? If yes, you are refering to what the
> "updatedb" tool will do, right? I'm thinking a little bit differently,
> by keeping as much of the current WebDBWriter interface. That is, the
> tool will not read/write the DB all by itself. It will still call
> methods like dbwriter.addPage() to write to the DB. This way you don't
> have to write the whole MapReduce process all over to do another kinda
> of mutation of the DB. Apart from that difference, my code kinda does
> the same thing, although I didn't use a combiner. All merging work is
> done in reduction.
>
> - Feng
>
>>
>> As mentioned may be I missed something, but since the job itself is a
>> kind of master the processes can be managed from the job.
>> Since all files would be written in a ndf folder that is unique it is
>> may not necessary to have any kind of id.
>>
>> Anyway I would love to see the code you mentioned to understand your
>> ideas.
>>
>> Stefan
>>
>>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real 
> users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>
---------------------------------------------------------------
company:		http://www.media-style.com
forum:		http://www.text-mining.org
blog:			http://www.find23.net


Mime
View raw message