nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Fetcher2 Reduce Phase Question
Date Fri, 11 Apr 2008 22:32:40 GMT
Sandeep Tata wrote:
> Hi Folks,
> 
> I was just wondering what computation really happens in the reduce
> phase for Fetcher2 ?

If Fetcher was running in the parsing mode, then in the reduce phase 
Outlinks are separated from Parse output and stored in crawl_parse, and 
other data in parse_text and parse_data. This actually happens in 
FetcherOutputFormat / ParseOutputFormat, so there is no need for any 
Reduce apart from the IdentityReduce (default)

> 
> I know that it is implemented as a MapRunnable -- but I see no
> explicit reducer being set for the job. Is the identity reducer being
> used ? Why can't we simply use job.setNumReduceTasks(0) ?
> Wouldn't this be faster?

First, when Fetcher / Fetcher2 were written there was no such option in 
Hadoop. Second, the meaning of this setting is that the output from maps 
becomes the final output - but this won't cut it, because map outputs 
are always simple SequenceFile's, whereas we need to split the 
FetcherOutput into a bunch of Sequence and MapFile-s (which have to be 
sorted) ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message