Hi All,
Relatively new to spark and scala, here.
I'm trying to solve a simple linear system of A*x = b where A is a sparse
matrix, and x and b are dense vectors. Standard fare, really. But the
solutions I've found from examples and tests don't seem efficient; I think
the use case I need is falling between the cracks (or I'm missing
something).
We have a barely parallel problem of medium sized data: N = 10k100k
elements
and the A matrix is a sparse NxN matrix (density of 0.0010.003).
Currently, these can be solved on single processor machines. Ideally, I'm
looking for a reasonably fast method that can handle larger data (N will
grow with time), and scales with more cores / machines.
I've looked at many examples on the web and archives, and apologies if I
missed something. I've already tried the LinearRegressionWithSGD as well as
an SVD approach, and both were slow and required arbitrary tuning
(stepsize, k principle vectors, etc). (This wasn't unexpected, but gave me
a baseline. Both approaches do more work than is necessary in this case.)
I hand rolled a simple conjugate gradient as well:
https://gist.github.com/cmcbride/7b8551400da179f24345
(all comments / headkicks / improvements are welcome)
This is one of several approaches I tried, but utilizes an IndexedRowMatrix
that uses sparse vectors for the rows, and a hand rolled Matrixvector
multiplication. I'm being a little pedantic using tuples to ensure proper
ordering of the output.
In any case, all of these are waaay slower (10x100x slower or more) than a
single threaded scala solution (conjugate gradient approach) currently in
use. I'd obviously like to improve that.
Thanks!
Cameron
