I'd expect that we can make GPUaccelerated BLAS faster than CPU blas in
many cases.
You might consider taking a look at the codepaths that BIDMat (
https://github.com/BIDData/BIDMat) takes and comparing them to
netlibjava/breeze. John Canny et. al. have done a bunch of work optimizing
to make this work really fast from Scala. I've run it on my laptop and
compared to MKL and in certain cases it's 10x faster at matrix multiply.
There are a lot of layers of indirection here and you really want to avoid
data copying as much as possible.
We could also consider swapping out BIDMat for Breeze, but that would be a
big project and if we can figure out how to get breeze+cublas to comparable
performance that would be a big win.
On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <alexander.ulanov@hp.com>
wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlibjava that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrixmatrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlibjava page
> https://github.com/fommil/netlibjava. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpubased blas) with Netlibjava wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrixmatrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speedup linear algebra in Spark?
>
> Thank you, Alexander
>
> 
> To unsubscribe, email: devunsubscribe@spark.apache.org
> For additional commands, email: devhelp@spark.apache.org
>
>
