spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <alexander.ula...@hp.com>
Subject Using CUDA within Spark / boosting linear algebra
Date Thu, 05 Feb 2015 19:55:21 GMT
Dear Spark developers,

I am exploring how to make linear algebra operations faster within Spark. One way of doing
this is to use Scala Breeze library that is bundled with Spark. For matrix operations, it
employs Netlib-java that has a Java wrapper for BLAS (basic linear algebra subprograms) and
LAPACK native binaries if they are available on the worker node. It also has its own optimized
Java implementation of BLAS. It is worth mentioning, that native binaries provide better performance
only for BLAS level 3, i.e. matrix-matrix operations or general matrix multiplication (GEMM).
This is confirmed by GEMM test on Netlib-java page https://github.com/fommil/netlib-java.
I also confirmed it with my experiments with training of artificial neural network https://github.com/apache/spark/pull/1290#issuecomment-70313952.
However, I would like to boost performance more.

GPU is supposed to work fast with linear algebra and there is Nvidia CUDA implementation of
BLAS, called cublas. I have one Linux server with Nvidia GPU and I was able to do the following.
I linked cublas (instead of cpu-based blas) with Netlib-java wrapper and put it into Spark,
so Breeze/Netlib is using it. Then I did some performance measurements with regards to artificial
neural network batch learning in Spark MLlib that involves matrix-matrix multiplications.
It turns out that for matrices of size less than ~1000x780 GPU cublas has the same speed as
CPU blas. Cublas becomes slower for bigger matrices. It worth mentioning that it is was not
a test for ONLY multiplication since there are other operations involved. One of the reasons
for slowdown might be the overhead of copying the matrices from computer memory to graphic
card memory and back. 

So, few questions:
1) Do these results with CUDA make sense? 
2) If the problem is with copy overhead, are there any libraries that allow to force intermediate
results to stay in graphic card memory thus removing the overhead?
3) Any other options to speed-up linear algebra in Spark?

Thank you, Alexander

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message