spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jfcanny <>
Subject Re: Using CUDA within Spark / boosting linear algebra
Date Fri, 13 Mar 2015 03:50:24 GMT
If you're contemplating GPU acceleration in Spark, its important to look
beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
datasets we've tested in BIDMach, and we've tried to make them
representative of industry machine learning workloads. Unless you're
crunching images or audio, the majority of data will be very sparse and
power law distributed. You need a good sparse BLAS, and in practice it seems
like you need a sparse BLAS tailored for power-law data. We had to write our
own since the NVIDIA libraries didnt perform well on typical power-law data.
Intel MKL sparse BLAS also have issues and we only use some of them. 

You also need 2D reductions, scan operations, slicing, element-wise
transcendental functions and operators, many kinds of sort, random number
generators etc, and some kind of memory management strategy. Some of this
was layered on top of Thrust in BIDMat, but most had to be written from
scratch. Its all been rooflined, typically to memory throughput of current
GPUs (around 200 GB/s). 

When you have all this you can write Learning Algorithms in the same
high-level primitives available in Breeze or Numpy/Scipy. Its literally the
same in BIDMat, since the generic matrix operations are implemented on both
CPU and GPU, so the same code runs on either platform. 

A lesser known fact is that GPUs are around 10x faster for *all* those
operations, not just dense BLAS. Its mostly due to faster streaming memory
speeds, but some kernels (random number generation and transcendentals) are
more than an order of magnitude thanks to some specialized hardware for
power series on the GPU chip. 

When you have all this there is no need to move data back and forth across
the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
and feed them to the available GPUs. Most models fit comfortably in GPU
memory these days (4-12 GB). With minibatch algorithms you can push TBs of
data through the GPU this way. 

View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message