spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cameron McBride <cameron.mcbr...@gmail.com>
Subject spark distributed linear system with sparse data
Date Tue, 29 Sep 2015 19:58:04 GMT
Hi All,

Relatively new to spark and scala, here.

I'm trying to solve a simple linear system of A*x = b where A is a sparse
matrix, and x and b are dense vectors. Standard fare, really. But the
solutions I've found from examples and tests don't seem efficient; I think
the use case I need is falling between the cracks (or I'm missing
something).

We have a barely parallel problem of medium sized data: N = 10k-100k
elements
and the A matrix is a sparse NxN matrix (density of 0.001-0.003).
Currently, these can be solved on single processor machines. Ideally, I'm
looking for a reasonably fast method that can handle larger data (N will
grow with time), and scales with more cores / machines.

I've looked at many examples on the web and archives, and apologies if I
missed something. I've already tried the LinearRegressionWithSGD as well as
an SVD approach, and both were slow and required arbitrary tuning
(stepsize, k principle vectors, etc). (This wasn't unexpected, but gave me
a baseline. Both approaches do more work than is necessary in this case.)

I hand rolled a simple conjugate gradient as well:
https://gist.github.com/cmcbride/7b8551400da179f24345
(all comments / headkicks / improvements are welcome)

This is one of several approaches I tried, but utilizes an IndexedRowMatrix
that uses sparse vectors for the rows, and a hand rolled Matrix-vector
multiplication. I'm being a little pedantic using tuples to ensure proper
ordering of the output.

In any case, all of these are waaay slower (10x-100x slower or more) than a
single threaded scala solution (conjugate gradient approach) currently in
use. I'd obviously like to improve that.

Thanks!

Cameron

Mime
View raw message