mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pradhuman Jhala <Pradhuman.Jh...@fox.com>
Subject mahout & hadoop compatibility
Date Fri, 05 Dec 2008 01:11:23 GMT
 
Just wondering if Mahout is compatible with hadoop-0.18 (and later) versions.  As in hadoop
version 0.18 onwards, the combiner execution policy has changed and now it gets executed 
twice - first from Mapper side (on the output of  Mapper) and then again on the Reducer side
(on the output of first Combiner). 
 
For more details: http://issues.apache.org/jira/browse/HADOOP-3226 <http://issues.apache.org/jira/browse/HADOOP-3226>

 
It seems me that the kmean and canopy clustering in Mahout assumes that the combiner gets
executed on Mapper side only and it's a major source of error, as when the Combiner gets executed
on the Reducer side, it can not parse the output of first Combiner correctly. 
 
To fix, only for hadoop-0.18.*, if you want to use combiner only on the output of mapper (like
earlier hadoop versions), add the following to your job config:
 
job.setCombineOnlyOnce(true); 
  
This method (setCombineOnlyOnce) is not available in hadoop-0.19 release, so I think Mahout
code needs to be changed to take care of this issue. 
 
Pradhuman
     
  

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message