spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Grossman <j...@rice.edu>
Subject Re: Using CUDA within Spark / boosting linear algebra
Date Thu, 04 Feb 2016 15:13:05 GMT
Hi all,

I’m jumping on this thread to point out another Spark+GPU project for people to take a look
at: https://github.com/agrippa/spark-swat <https://github.com/agrippa/spark-swat>

SWAT (Spark with Accelerated Tasks) is a third-party JAR sitting on top of Spark that uses
runtime code generation to convert user-written transformations into OpenCL kernels. SWAT’s
lightweight runtime supports multi-GPU systems, managing each device and its memory automatically.
You write your own Spark programs, and the runtime takes care of offloading your transformations
to the GPUs in your system:

val rdd = CLWrapper.cl(sc.objectFile(inputPath))
val next = rdd.map(i => 2 * i).collect

SWAT primarily distinguishes itself in programmability: an explicit goal of this project is
to have as few user-visible API changes as possible from what people have come to know and
love in Spark. There are a number of fixed-function GPU libraries out there now, so we wanted
to look instead at something that could be used to build new but still well-performing Spark
apps.

SWAT is currently more of a research project than a production-ready system, so there’s
a chance it won’t work out-of-the-box on some systems. With that said, it does have fairly
comprehensive functional and code generation testing. If you’re interested in trying it
out and having trouble setting up, feel free to contact me directly. And of course, any questions
or feedback from the community are always welcome.

Thanks,

Max

> On Jan 22, 2016, at 3:42 AM, Kazuaki Ishizaki <ISHIZAKI@jp.ibm.com> wrote:
> 
> Hi Alexander,
> The goal of our columnar to effectively drive GPUs in Spark. One of important items is
to effectively and easily enable highly-tuned libraries for GPU such as BIDMach.
> 
> We will enable BIDMach with our columnar storage. On the other hand, it is not easy task
to scaling BIDMach with current Spark. I expect that this talk would help us.
> http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565
<http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565>
> 
> We appreciate your great feedback.
> 
> Best Regards,
> Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - Tokyo
> 
> 
> 
> From:        "Ulanov, Alexander" <alexander.ulanov@hpe.com>
> To:        Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" <dev@spark.apache.org>,
Joseph Bradley <joseph@databricks.com>
> Cc:        John Canny <canny@berkeley.edu>, "Evan R. Sparks" <evan.sparks@gmail.com>,
Xiangrui Meng <mengxr@gmail.com>, Sam Halliday <sam.halliday@gmail.com>
> Date:        2016/01/22 04:20
> Subject:        RE: Using CUDA within Spark / boosting linear algebra
> 
> 
> 
> Hi Kazuaki,
>  
> Indeed, moving data to/from GPU is costly and this benchmark summarizes the costs for
moving different data sizes with regards to matrices multiplication. These costs are paid
for the convenience of using the standard BLAS API that Nvidia NVBLAS provides. The thing
is that there are no code changes required (in Spark), one just needs to reference BLAS implementation
with the system variable. Naturally, hardware-specific implementation will always be faster
than default. The benchmark results show that fact by comparing jCuda (by means of BIDMat)
and NVBLAS. However, it also shows that it worth using NVBLAS for large matrices because it
can take advantage of several GPUs and it will be faster despite the copying overhead. That
is also a known thing advertised by Nvidia.
>  
> By the way, I don’t think that the column/row friendly format is an issue, because
one can use transposed matrices to fit the required format. I believe that is just a software
preference.
>  
> My suggestion with regards to your prototype would be to make comparisons with Spark’s
implementation of logistic regression (that does not take advantage of GPU) and also with
BIDMach’s (that takes advantage of GPUs). It will give the users a better understanding
of your’s implementation performance. Currently you compare it with Spark’s example logistic
regression implementation that is supposed to be a reference for learning Spark rather than
benchmarking its performance.
>  
> Best regards, Alexander
>  
> From: Kazuaki Ishizaki [mailto:ISHIZAKI@jp.ibm.com <mailto:ISHIZAKI@jp.ibm.com>]

> Sent: Thursday, January 21, 2016 3:34 AM
> To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley
> Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>  
> Dear all,
> 
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice).  Having Spark be aware of the GPU and using it as another
part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> 
> As Joseph pointed out before, there are two potential issues to efficiently exploit GPUs
in Spark.
> (1) the cost of data movement between CPU and GPU
> (2) the cost of encoding/decoding between current row-format and GPU-friendly column
format
> 
> Our prototype http://kiszk.github.io/spark-gpu/ <http://kiszk.github.io/spark-gpu/>addresses
these two issues by supporting data partition caching in GPU device memory and by providing
binary column storage for data partition. We really appreciate it if you would give us comments,
suggestions, or feedback.
> 
> Best Regards
> Kazuaki Ishizaki
> 
> 
> 
> From:        "Ulanov, Alexander" <alexander.ulanov@hpe.com <mailto:alexander.ulanov@hpe.com>>
> To:        Sam Halliday <sam.halliday@gmail.com <mailto:sam.halliday@gmail.com>>,
John Canny <canny@berkeley.edu <mailto:canny@berkeley.edu>>
> Cc:        Xiangrui Meng <mengxr@gmail.com <mailto:mengxr@gmail.com>>, "dev@spark.apache.org
<mailto:dev@spark.apache.org>" <dev@spark.apache.org <mailto:dev@spark.apache.org>>,
Joseph Bradley <joseph@databricks.com <mailto:joseph@databricks.com>>, "Evan R.
Sparks" <evan.sparks@gmail.com <mailto:evan.sparks@gmail.com>>
> Date:        2016/01/21 11:07
> Subject:        RE: Using CUDA within Spark / boosting linear algebra
> 
> 
> 
> 
> Hi Everyone,
> 
> I’ve updated the benchmark and done experiments with new hardware with 2x Nvidia Tesla
K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel E5-2650 v3 @ 2.30GHz.
> 
> This time I computed average and median of 10 runs for each of experiment and approximated
FLOPS.
> 
> Results are available at google docs (old experiments are in the other 2 sheets):
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
<https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing>
> Benchmark code:
> https://github.com/avulanov/scala-blas <https://github.com/avulanov/scala-blas>
> 
> Best regards, Alexander
> 
> 
> From: Sam Halliday [mailto:sam.halliday@gmail.com <mailto:sam.halliday@gmail.com>]

> Sent: Thursday, March 26, 2015 9:27 AM
> To: John Canny
> Cc: Xiangrui Meng; dev@spark.apache.org <mailto:dev@spark.apache.org>; Joseph Bradley;
Evan R. Sparks; Ulanov, Alexander
> Subject: Re: Using CUDA within Spark / boosting linear algebra
> John, I have to disagree with you there. Dense matrices come up a lot in industry,  although
your personal experience may be different. 
> On 26 Mar 2015 16:20, "John Canny" <canny@berkeley.edu <mailto:canny@berkeley.edu>>
wrote:
> I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS are not
very important for most machine learning workloads: at least for non-image workloads in industry
(and for image processing you would probably want a deep learning/SGD solution with convolution
kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, which should be a reasonable
sample. What really matters is sparse BLAS performance. BIDMat is still an order of magnitude
faster there. Those kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well
on power-law data. 
> 
> Its also the case that the overall performance of an algorithm is determined by the slowest
kernel, not the fastest. If the goal is to get closer to BIDMach's performance on typical
problems, you need to make sure that every kernel goes at comparable speed. So the real question
is how much faster MLLib routines do on a complete problem with/without GPU acceleration.
For BIDMach, its close to a factor of 10. But that required running entirely on the GPU, and
making sure every kernel is close to its limit.
> 
> -John
> 
> If you think nvblas would be helpful, you should try it in some end-to-end benchmarks.

> On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
> Yeah, much more reasonable - nice to know that we can get full GPU performance from breeze/netlib-java
- meaning there's no compelling performance reason to switch out our current linear algebra
library (at least as far as this benchmark is concerned). 
> 
> Instead, it looks like a user guide for configuring Spark/MLlib to use the right BLAS
library will get us most of the way there. Or, would it make sense to finally ship openblas
compiled for some common platforms (64-bit linux, windows, mac) directly with Spark - hopefully
eliminating the jblas warnings once and for all for most users? (Licensing is BSD) Or am I
missing something?
> 
> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com>>
wrote:
> As everyone suggested, the results were too good to be true, so I double-checked them.
It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from "nvblas.conf"
and returned zero matrix. My previously posted results with nvblas are matrices copying only.
The default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked
other values that worked. As a result, netlib+nvblas is on par with BIDMat-cuda. As promised,
I am going to post a how-to for nvblas configuration.
> 
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
<https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing>
> 
> 
> 
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Wednesday, March 25, 2015 2:31 PM
> To: Sam Halliday
> Cc: dev@spark.apache.org <mailto:dev@spark.apache.org>; Xiangrui Meng; Joseph Bradley;
Evan R. Sparks; jfcanny
> Subject: RE: Using CUDA within Spark / boosting linear algebra
> 
> Hi again,
> 
> I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance
for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices,
if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates
with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
<http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf>
> 
> My results:
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
<https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing>
> 
> Just in case, these tests are not for generalization of performance of different libraries.
I just want to pick a library that does at best dense matrices multiplication for my task.
> 
> P.S. My previous issue with nvblas was the following: it has Fortran blas functions,
at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to
use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so
I needed to compile it. I could not use cblas from Atlas or Openblas because they link to
their implementation and not to Fortran blas.
> 
> Best regards, Alexander
> 
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: dev@spark.apache.org <mailto:dev@spark.apache.org>; Xiangrui Meng; Joseph Bradley;
Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
> 
> Hi,
> 
> I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace
current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage
<http://docs.nvidia.com/cuda/nvblas/#Usage>without any changes to netlib-java. It seems
to work for simple Java example, but I cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G
In nvidia-smi I observe that Java is to use GPU:
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU Memory |
> |  GPU       PID  Type  Process name                               Usage      |
> |=============================================================================|
> |    0      8873    C   bash                                            39MiB |
> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java                39MiB |
> +-----------------------------------------------------------------------------+
> 
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix
multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also
checked different matrix sizes, from 100x100 to 12000x12000
> 
> Could you suggest might the LD_PRELOAD not affect Spark shell?
> 
> Best regards, Alexander
> 
> 
> 
> From: Sam Halliday [mailto:sam.halliday@gmail.com <mailto:sam.halliday@gmail.com>]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org <mailto:dev@spark.apache.org>; Xiangrui Meng; Joseph Bradley;
Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
> 
> 
> Thanks so much for following up on this!
> 
> Hmm, I wonder if we should have a concerted effort to chart performance on various pieces
of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that
BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current
source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par
with netlib MKL.
> 
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
<https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing>
> 
> Best regards, Alexander
> 
> -----Original Message-----
> From: Sam Halliday [mailto:sam.halliday@gmail.com <mailto:sam.halliday@gmail.com><mailto:sam.halliday@gmail.com
<mailto:sam.halliday@gmail.com>>]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org <mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
> 
> BTW, is anybody on this list going to the London Meetup in a few weeks?
> 
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
<https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community>
> 
> Would be nice to meet other people working on the guts of Spark! :-)
> 
> 
> Xiangrui Meng <mengxr@gmail.com <mailto:mengxr@gmail.com><mailto:mengxr@gmail.com
<mailto:mengxr@gmail.com>>> writes:
> 
> > Hey Alexander,
> >
> > I don't quite understand the part where netlib-cublas is about 20x
> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
> > with netlib-java?
> >
> > CC'ed Sam, the author of netlib-java.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <joseph@databricks.com <mailto:joseph@databricks.com><mailto:joseph@databricks.com
<mailto:joseph@databricks.com>>> wrote:
> >> Better documentation for linking would be very helpful!  Here's a JIRA:
> >> https://issues.apache.org/jira/browse/SPARK-6019 <https://issues.apache.org/jira/browse/SPARK-6019>
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> <evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>>>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ <https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ>
> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlib-java+openblas-compiled).
> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> >>> worse than a well-tuned CPU implementation, particularly for larger matrices.
> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>> basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlib-java <https://github.com/fommil/netlib-java>)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical - although we could consider having a good GPU
> >>> backend available as an option. However, *ALL* users of MLlib could
> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
> >>> BLAS implementation. Perhaps we should consider updating the mllib
> >>> guide with a more complete section for enabling high performance
> >>> binaries on OSX and Linux? Or better, figure out a way for the
> >>> system to fetch these automatically.
> >>>
> >>> - Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>> alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>>> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMat-cublas>>BIDMat
> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
> >>>> =netlib-cublas>netlib-blas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
<https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx>
> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMat-cublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley;
> >>>> dev@spark.apache.org <mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able to
> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>> |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>> 1569,233228 |
> >>>>
> >>>> It turn out that pre-compiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks
> >>>> [mailto:evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>>]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> dev@spark.apache.org <mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great - perhaps we can move this discussion off-list and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK-5705 <https://issues.apache.org/jira/browse/SPARK-5705>)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a while
> >>>> (and there's probably only a handful of us who really care about
> >>>> fast linear
> >>>> algebra!)
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>><mailto:alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
> >>>> it seems that in my case precompiled MKL BLAS performs better than
> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlib-java) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>><mailto:
> >>>> evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>>>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> dev@spark.apache.org <mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark <mailto:dev@spark>.
> >>>> apache.org <http://apache.org/><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >>>> from getting cache sizes, etc. set up correctly for your particular
> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
> >>>> quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path - export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlib-java setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrix-bench <https://github.com/shivaram/matrix-bench>
> >>>>
> >>>> In particular - build-openblas-ec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
> >>>> shows you how to get the path setup and get that library picked up by
netlib-java.
> >>>>
> >>>> In this way - you could probably get cuBLAS set up to be used by
> >>>> netlib-java as well.
> >>>>
> >>>> - Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>><mailto:alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>>>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> >>>> force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I understand
how to force use a specific blas (not specific wrapper for blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
> >>>> that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>><mailto:
> >>>> evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>>>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> dev@spark.apache.org <mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark <mailto:dev@spark>.
> >>>> apache.org <http://apache.org/><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
it).
> >>>> It might make sense to force BIDMat to use the same underlying BLAS
> >>>> library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>><mailto:alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>>>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
> >>>>
> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> >>>> |native_system_linux_x86-64|
> >>>> Breeze+Netlib-java f2jblas |
> >>>> +-----------------------------------------------------------------------+
> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
> >>>> ||
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>> slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:joseph@databricks.com <mailto:joseph@databricks.com><mailto:joseph@databricks.com
<mailto:joseph@databricks.com>><mailto:
> >>>> joseph@databricks.com <mailto:joseph@databricks.com><mailto:joseph@databricks.com
<mailto:joseph@databricks.com>>>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> dev@spark.apache.org <mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark <mailto:dev@spark>.
> >>>> apache.org <http://apache.org/><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting.  Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance.  If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice).  Having Spark be aware of the GPU and using it as another
part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>><mailto:alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>>>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> >>>> John Canny and I am really inspired by his talk and comparisons with
Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
> >>>> neural networks in batch mode. While it is not a “pure” test of
> >>>> linear algebra, it involves some other things that are essential to
machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>><mailto:
> >>>> evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>>>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc:
> >>>> dev@spark.apache.org <mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark <mailto:dev@spark>.
> >>>> apache.org <http://apache.org/><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> >>>> netlib-java+data
> >>>> layout and fewer levels of indirection - it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra. I
> >>>> know that in the past Canny has gone as far as to write custom GPU
> >>>> kernels for performance-critical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends to
> >>>> fall off. Canny argues for hardware/software codesign and as such
> >>>> prefers machine configurations that are quite different than what
> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and
4 GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets - order of terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to address
> >>>> slightly different use cases. That said, there may be bits of
> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
> >>>> careful about maintaining cross-language compatibility for our Java
> >>>> and Python-users, though.
> >>>>
> >>>> - Evan
> >>>>
> >>>> [1] - http://arxiv.org/abs/1409.5402 <http://arxiv.org/abs/1409.5402>[2]
-
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf <http://eecs.berkeley.edu/%7Ehzhao/papers/BD.pdf>
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>><mailto:alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>>><mailto:
> >>>> alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>><mailto:alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>>>>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlib-java?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib? If
> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>><mailto:
> >>>> evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>>><mailto:evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>><mailto:
> >>>> evan.sparks@gmail.com <mailto:evan.sparks@gmail.com><mailto:evan.sparks@gmail.com
<mailto:evan.sparks@gmail.com>>>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: dev@spark.apache.org <mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark.apache.org <mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>><mailto:
> >>>> dev@spark.apache.org <mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark <mailto:dev@spark>.
> >>>> apache.org <http://apache.org/><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat <https://github.com/BIDData/BIDMat>)
takes and comparing them to
> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it on
> >>>> my laptop and compared to MKL and in certain cases it's 10x faster at
matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>><mailto:alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>>><mailto:
> >>>> alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>><mailto:alexander.ulanov@hp.com <mailto:alexander.ulanov@hp.com><mailto:alexander.ulanov@hp.com
<mailto:alexander.ulanov@hp.com>>>>> wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster within Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
> >>>> and LAPACK native binaries if they are available on the worker
> >>>> node. It also has its own optimized Java implementation of BLAS. It
> >>>> is worth mentioning, that native binaries provide better performance
only for BLAS level 3, i.e.
> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlib-java page
> >>>> https://github.com/fommil/netlib-java <https://github.com/fommil/netlib-java>.
I also confirmed it with my
> >>>> experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952 <https://github.com/apache/spark/pull/1290#issuecomment-70313952>.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
> >>>> server with Nvidia GPU and I was able to do the following. I linked
> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
> >>>> performance measurements with regards to artificial neural network
> >>>> batch learning in Spark MLlib that involves matrix-matrix
> >>>> multiplications. It turns out that for matrices of size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
> >>>> slower for bigger matrices. It worth mentioning that it is was not a
test for ONLY multiplication since there are other operations involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> -- To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <mailto:dev-unsubscribe@spark.apache.org><mailto:dev-unsubscribe@spark.apache.org
<mailto:dev-unsubscribe@spark.apache.org>><mailto:
> >>>> dev-unsubscribe@spark.apache.org <mailto:dev-unsubscribe@spark.apache.org><mailto:dev-unsubscribe@spark.apach
<mailto:dev-unsubscribe@spark.apach>
> >>>> e.org <http://e.org/>>><mailto:dev-unsubscribe@spark.apac
<mailto:dev-unsubscribe@spark.apac><mailto:dev-unsubscribe@sp <mailto:dev-unsubscribe@sp>
> >>>> ark.apac> he.org <http://he.org/><http://he.org <http://he.org/>>
> >>>> <mailto:dev-unsubscribe@spark.apache.org <mailto:dev-unsubscribe@spark.apache.org><mailto:dev-unsubscribe@spa
<mailto:dev-unsubscribe@spa>
> >>>> rk.apache.org <http://rk.apache.org/>>>> For additional
commands, e-mail:
> >>>> dev-help@spark.apache.org <mailto:dev-help@spark.apache.org><mailto:dev-help@spark.apache.org
<mailto:dev-help@spark.apache.org>><mailto:
> >>>> dev-help@spark.apache.org <mailto:dev-help@spark.apache.org><mailto:dev-help@spark.apache.org
<mailto:dev-help@spark.apache.org>>><mailto:dev-help@spark.apache.org <mailto:dev-help@spark.apache.org><mailto:dev-help@spark.apache.org
<mailto:dev-help@spark.apache.org>><mailto:
> >>>> dev-help@spark.apache.org <mailto:dev-help@spark.apache.org><mailto:dev-help@spark.apache.org
<mailto:dev-help@spark.apache.org>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> 
> --
> Best regards,
> Sam
> 
> 
> 


Mime
View raw message