mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Issue: Canopy is processing extremly slow, what goes wrong?
Date Tue, 13 Nov 2012 14:01:59 GMT
Canopy is very sensitive to the value of T2: Too small a value will 
cause the creation of very many canopies in each mapper and these will 
swamp the reducer.  I suggest you begin with T1=T2= <a larger value> 
until you get enough canopies. With CosineDistanceMeasure, a value of 1 
ought to produce only a single canopy and you can go smaller until you 
get a reasonable number. There are also T3 and T4 arguments that allow 
you to specify the T1 and T2 values used by the reducer.

On 11/13/12 7:01 AM, Phoenix Bai wrote:
> Hi All,
>
> 1) data size:
> environment: company`s hadoop clusters.
> Raw data: 12M
> tfidf vectors: 25M (ng is set to 2)
>
> 2) running command:
> tfidf vectors is feed to canopy and run the command below:
>
> hadoop jar $MAHOUT_HOME/mahout-core-0.5-job.jar
> org.apache.mahout.clustering.canopy.CanopyDriver
> -Dmapred.max.split.size=4000000 \
> -i /mahout/vectors/tbvideo-vectors/tfidf-vectors \
> -o /mahout/output/tbvideo-canopy-centroids/ \
> -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> -t1 0.70 -t2 0.3
>
> 3) canopy running status:
> and the MR runs like forever. I mean, map could finish very quickly while
> the reducer task always hang at 66% like below:
>
> 12/11/13 16:29:00 INFO mapred.JobClient:  map 96% reduce 0%
> 12/11/13 16:29:07 INFO mapred.JobClient:  map 96% reduce 30%
> 12/11/13 16:29:26 INFO mapred.JobClient:  map 100% reduce 30%
> 12/11/13 16:29:41 INFO mapred.JobClient:  map 100% reduce 66%
> 12/11/13 19:34:39 INFO mapred.JobClient:  map 100% reduce 0%
> 12/11/13 19:34:47 INFO mapred.JobClient: Task Id :
> attempt_201210311519_1936030_r_000000_0, Status : FAILED
> java.io.IOException: Task process exit with nonzero status of 137.
>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:456)
> 12/11/13 19:35:06 INFO mapred.JobClient:  map 100% reduce 66%
>
> or sometimes erorr like this:
>
> 000000_0, Status : FAILED
> Task attempt_201210311519_1900983_r_000000_0 failed to report status
> for 600 seconds. Killing!
>
> Here is the jstack dump when it gets to 66%:
>
>   *
> "main" prio=10 tid=0x000000005071a000 nid=0x7ab8 runnable [0x0000000040a3a000]
>     java.lang.Thread.State: RUNNABLE
>          at org.apache.mahout.math.OrderedIntDoubleMapping.find(OrderedIntDoubleMapping.java:83)
>          at org.apache.mahout.math.OrderedIntDoubleMapping.get(OrderedIntDoubleMapping.java:88)
>          at org.apache.mahout.math.SequentialAccessSparseVector.getQuick(SequentialAccessSparseVector.java:184)
>          at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:138)
>          at org.apache.mahout.clustering.AbstractCluster.formatVector(AbstractCluster.java:301)
>          at org.apache.mahout.clustering.canopy.CanopyClusterer.addPointToCanopies(CanopyClusterer.java:163)
>          at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:44)
>          at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyRed
> *
> *ucer.java:29)
>          at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>          at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:544)
>          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:407)
>          at org.apache.hadoop.mapred.Child.main(Child.java:167)*
>
> **
> 4) So, my question is,
>
> what is wrong? why it always hang at 66%?
> I thought canopy is a faster algorithm when comparing to kmeans.
> but in this case, kmeans could run whole lot faster than canopy.
> I run the canopy several times across two days, and never get to see the
> end.
> it always throws errors whenever get to the 66% of reducing process.
>
> Please, enlighten me. or give me to a direction to what could be the
> problem? and How could I fix it?
> it is only 30M data, so it can`t be the size, right?
>
> Thanks all in advance!
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message