mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DAN HELM <>
Subject Distributed processing and CVB clustering
Date Mon, 04 Jun 2012 15:51:59 GMT
This could be more of a hadoop question/issue but I have a question about distributed processing
in CVB clustering.

Previously I created a derivative rowid program to generate multiple “matrix” files (i.e.,
one for each input “part” file generated by seq2sparse).  For my testing, the new rowid
generates 3 “matrix” files, matrix-0, matrix-1, and matrix-2.

When running CVB against these multiple “matrix” files I am getting (possibly) odd behavior. 
I am running on a 3 node cluster and noticed, as expected, the 3 matrix files are copied/reside
on 3 separate nodes (3 input split locations).

But when running CVB, where I specify the HDFS folder continuing the matrix files as input,
it seems to run 3 mappers on one node for each iteration.  For the first iteration of CVB,
the 3 mappers ran on the machine I submitted the job from (our namenode machine), for the
second iteration a different node was selected to run the 3 mappers, for iteration 3, a different
node was selected again, etc.

Each node in our cluster is quite high-end and very underutilized so I’m wondering if hadoop
is running mappers on the same machine since there are lots of available cores?

Thanks, Dan
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message