Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat
0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source
code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with
netlib MKL.
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Best regards, Alexander
Original Message
From: Sam Halliday [mailto:sam.halliday@gmail.com]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra
BTW, is anybody on this list going to the London Meetup in a few weeks?
https://skillsmatter.com/meetups/6987apachesparklivingthepostmapreduceworld#community
Would be nice to meet other people working on the guts of Spark! :)
Xiangrui Meng <mengxr@gmail.com> writes:
> Hey Alexander,
>
> I don't quite understand the part where netlibcublas is about 20x
> slower than netlibopenblas. What is the overhead of using a GPU BLAS
> with netlibjava?
>
> CC'ed Sam, the author of netlibjava.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <joseph@databricks.com> wrote:
>> Better documentation for linking would be very helpful! Here's a JIRA:
>> https://issues.apache.org/jira/browse/SPARK6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> <evan.sparks@gmail.com>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks,
>>> Alex. The big takeaways here can be seen with this chart:
>>>
>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of
>>> BIDMat+magnitude)
>>> benefit over a welltuned CPU implementation (e.g. BIDMat+MKL or
>>> netlibjava+openblascompiled).
>>> 2) A poorly tuned CPU implementation can be 12 orders of magnitude
>>> worse than a welltuned CPU implementation, particularly for larger matrices.
>>> (netlibf2jblas or netlibref) This is not to pick on netlib  this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlibjava)
>>>
>>> I think that most of our users are in a situation where using GPUs
>>> may not be practical  although we could consider having a good GPU
>>> backend available as an option. However, *ALL* users of MLlib could
>>> benefit (potentially tremendously) from using a welltuned CPUbased
>>> BLAS implementation. Perhaps we should consider updating the mllib
>>> guide with a more complete section for enabling high performance
>>> binaries on OSX and Linux? Or better, figure out a way for the
>>> system to fetch these automatically.
>>>
>>>  Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> alexander.ulanov@hp.com> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all
>>>> performance comparisons that we discussed. It turns out that:
>>>> BIDMatcublas>>BIDMat
>>>> MKL==netlibmkl==netlibopenblascompiled>netlibopenblasyumrepo=
>>>> =netlibcublas>netlibblas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMatcublas perform
>>>> copying to/from machine’s RAM?
>>>>
>>>> Original Message
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley; dev@spark.apache.org
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>>> the original one discusses slightly different topic. I was able to
>>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> statically linked inside a 60MB library.
>>>>
>>>> A*B size  BIDMat MKL  Breeze+NetlibMKL from BIDMat
>>>> Breeze+NetlibOpenBlas(native system) Breeze+Netlibf2jblas 
>>>> ++
>>>> 100x100*100x100  0,00205596  0,000381  0,03810324  0,002556 
>>>> 1000x1000*1000x1000  0,018320947  0,038316857  0,51803557
>>>> 1,638475459 
>>>> 10000x10000*10000x10000  23,78046632  32,94546697 445,0935211 
>>>> 1569,233228 
>>>>
>>>> It turn out that precompiled MKL is faster than precompiled
>>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>>> locally compiled openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley; dev@spark.apache.org
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great  perhaps we can move this discussion offlist and onto a
>>>> JIRA ticket? (Here's one:
>>>> https://issues.apache.org/jira/browse/SPARK5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a while
>>>> (and there's probably only a handful of us who really care about
>>>> fast linear
>>>> algebra!)
>>>>
>>>>  Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build
>>>> OpenBLAS, link it with Netlibjava and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically
>>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>>>> wonder if it is OK because Intel sells this library. Nevertheless,
>>>> it seems that in my case precompiled MKL BLAS performs better than
>>>> precompiled OpenBLAS given that BIDMat and Netlibjava are supposed to be
on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlibjava with Intel MKL,
>>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>>> Halliday
>>>> (Netlibjava) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
>>>> evan.sparks@gmail.com>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance comes
>>>> from getting cache sizes, etc. set up correctly for your particular
>>>> hardware  this is often a very tricky process (see, e.g. ATLAS),
>>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>>>> quickly and yields performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make
>>>> sure it's first on the search path  export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlibjava setup on an ec2 node and
>>>> some example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrixbench
>>>>
>>>> In particular  buildopenblasec2.sh shows you how to build the
>>>> library and set up symlinks correctly, and scala/runnetlib.sh
>>>> shows you how to get the path setup and get that library picked up by netlibjava.
>>>>
>>>> In this way  you could probably get cuBLAS set up to be used by
>>>> netlibjava as well.
>>>>
>>>>  Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and netlibjava to
>>>> force loading the right blas? For netlib, I there are few JVM
>>>> flags, such as
>>>> Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> so I can force it to use Java implementation. Not sure I understand how to
force use a specific blas (not specific wrapper for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I suppose
>>>> that netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
>>>> evan.sparks@gmail.com>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>> It might make sense to force BIDMat to use the same underlying BLAS
>>>> library as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>>> faster than netlibjava+breeze (sorry for weird table formatting):
>>>>
>>>> A*B size  BIDMat MKL  Breeze+Netlibjava
>>>> native_system_linux_x8664
>>>> Breeze+Netlibjava f2jblas 
>>>> ++
>>>> 100x100*100x100  0,00205596  0,03810324  0,002556 
>>>> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
>>>> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228
>>>> 
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>>>> 19 Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda
>>>> version for this purpose.
>>>>
>>>> Do you have any ideas why breezenetlib with native blas is so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:joseph@databricks.com<mailto:
>>>> joseph@databricks.com>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks;
>>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting. Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance. If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice). Having Spark be aware of the GPU and using it as another part
of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>>>> John Canny and I am really inspired by his talk and comparisons with Spark
MLlib.
>>>>
>>>> I am very interested to find out what will be better within Spark:
>>>> BIDMat or netlibjava with CPU or GPU natives. Could you suggest a
>>>> fair way to benchmark them? Currently I do benchmarks on artificial
>>>> neural networks in batch mode. While it is not a “pure” test of
>>>> linear algebra, it involves some other things that are essential to machine
learning.
>>>>
>>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
>>>> evan.sparks@gmail.com>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlibjava+OpenBLAS, but if it is much faster it's probably due to
>>>> netlibjava+data
>>>> layout and fewer levels of indirection  it's definitely a
>>>> worthwhile experiment to run. The main speedups I've seen from
>>>> using it come from highly optimized GPU code for linear algebra. I
>>>> know that in the past Canny has gone as far as to write custom GPU
>>>> kernels for performancecritical regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or
>>>> performance on small clusters.[2] Once data doesn't fit easily in
>>>> GPU memory (or can be batched in that way) the performance tends to
>>>> fall off. Canny argues for hardware/software codesign and as such
>>>> prefers machine configurations that are quite different than what
>>>> we find in most commodity cluster nodes  e.g. 10 disk cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on
>>>> commodity clusters and works best on very big datasets  order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to address
>>>> slightly different use cases. That said, there may be bits of
>>>> BIDMach we could repurpose for MLlib  keep in mind we need to be
>>>> careful about maintaining crosslanguage compatibility for our Java
>>>> and Pythonusers, though.
>>>>
>>>>  Evan
>>>>
>>>> [1]  http://arxiv.org/abs/1409.5402 [2] 
>>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>>> you know what makes them faster than netlibjava?
>>>>
>>>> The same group has BIDMach library that implements machine
>>>> learning. For some examples they use Caffe convolutional neural
>>>> network library owned by another group in Berkeley. Could you
>>>> elaborate on how these all might be connected with Spark Mllib? If
>>>> you take BIDMat for linear algebra why don’t you take BIDMach for optimization
and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
>>>> evan.sparks@gmail.com><mailto:evan.sparks@gmail.com<mailto:
>>>> evan.sparks@gmail.com>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
>>>> dev@spark.apache.org<mailto:dev@spark.apache.org>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPUaccelerated BLAS faster than CPU
>>>> blas in many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlibjava/breeze. John Canny et. al. have done a bunch of work
>>>> optimizing to make this work really fast from Scala. I've run it on
>>>> my laptop and compared to MKL and in certain cases it's 10x faster at matrix
multiply.
>>>> There are a lot of layers of indirection here and you really want
>>>> to avoid data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that
>>>> would be a big project and if we can figure out how to get
>>>> breeze+cublas to comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
>>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>> One way of doing this is to use Scala Breeze library that is
>>>> bundled with Spark. For matrix operations, it employs Netlibjava
>>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>>>> and LAPACK native binaries if they are available on the worker
>>>> node. It also has its own optimized Java implementation of BLAS. It
>>>> is worth mentioning, that native binaries provide better performance only
for BLAS level 3, i.e.
>>>> matrixmatrix operations or general matrix multiplication (GEMM).
>>>> This is confirmed by GEMM test on Netlibjava page
>>>> https://github.com/fommil/netlibjava. I also confirmed it with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is
>>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>>>> server with Nvidia GPU and I was able to do the following. I linked
>>>> cublas (instead of cpubased blas) with Netlibjava wrapper and put
>>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>>> performance measurements with regards to artificial neural network
>>>> batch learning in Spark MLlib that involves matrixmatrix
>>>> multiplications. It turns out that for matrices of size less than
>>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>>>> slower for bigger matrices. It worth mentioning that it is was not a test
for ONLY multiplication since there are other operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying
>>>> the matrices from computer memory to graphic card memory and back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries
>>>> that allow to force intermediate results to stay in graphic card
>>>> memory thus removing the overhead?
>>>> 3) Any other options to speedup linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>> 
>>>>  To unsubscribe, email: devunsubscribe@spark.apache.org<mailto:
>>>> devunsubscribe@spark.apache.org><mailto:devunsubscribe@spark.apac
>>>> he.org <mailto:devunsubscribe@spark.apache.org>>
>>>> For additional commands, email: devhelp@spark.apache.org<mailto:
>>>> devhelp@spark.apache.org><mailto:devhelp@spark.apache.org<mailto:
>>>> devhelp@spark.apache.org>>
>>>>
>>>>
>>>>
>>>>
>>>

Best regards,
Sam

To unsubscribe, email: devunsubscribe@spark.apache.org
For additional commands, email: devhelp@spark.apache.org
