I am implementing a matrix factorisation technique for matrices that does not
fit in memory of a node. I have checked the documentation and the book
Mahout in Action for the distributed matrix operations DistributedRowMatrixI
need to carry out some distributed matrix operations. I have designed the
algorithm in that way.
Three matrices A B and C
Divide the matrix A into chunks
Divide C into chunks
Map chunks of A, C and the matrix B
Compute the updates
Reduce Matrix C then compute Matrix B
Repeat the above set of operations for Maxiterations
1do I need to distribute the matrices on the cluster if I am carrying out
operations
2How can I control the amount of parallelism by the mappers for example.
3When I used the constructor of the DistributedRowMatrix
DistributedRowMatrix m = new
DistributedRowMatrix("path/to/vector/sequenceFile", "tmp/path", 10000000,
250000);
from the example found on
https://hudson.apache.org/hudson/job/MahoutQuality/javadoc/org/apache/mahout/math/hadoop/DistributedRowMatrix.html#getOutputTempPath()
it gives The constructor DistributedRowMatrix(String, String, int, int) is
undefined
I dug a bit and i found that the first two parameters are string and string
however i found that they should recieve a type Path that I tried to define
intialise like that Path in=new Path("path/to/vector/sequenceFile");//
"path/to/vector/sequenceFile"
Path out=new Path("/tmp/path");
then I passed in and out as parameters
DistributedRowMatrix m = new DistributedRowMatrix(in,out, 10000000, 250000);
4Another point is the m.configure(new JobConf()); produces a warning of
deperciated JobConf.
5Is there anyside effect from using the deperciated JobConf.
6Would anybody pinpoint me to how to package this job and run it on a
cluster
7However I am not sure how to pass the sequence file when it is residing on
the HDFS.
Sorry if some of the questions might look naive.
I apperciate any insights.
Regards
Ahmed Nagy

