spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Re: Using CUDA within Spark / boosting linear algebra
Date Fri, 13 Mar 2015 05:53:52 GMT
Thanks for chiming in, John. I missed your meetup last night - do you have
any writeups or slides about roofline design? In particular, I'm curious
about what optimizations are available for power-law dense * sparse? (I
don't have any background in optimizations)

On Thu, Mar 12, 2015 at 8:50 PM, jfcanny <> wrote:

> If you're contemplating GPU acceleration in Spark, its important to look
> beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
> datasets we've tested in BIDMach, and we've tried to make them
> representative of industry machine learning workloads. Unless you're
> crunching images or audio, the majority of data will be very sparse and
> power law distributed. You need a good sparse BLAS, and in practice it
> seems
> like you need a sparse BLAS tailored for power-law data. We had to write
> our
> own since the NVIDIA libraries didnt perform well on typical power-law
> data.
> Intel MKL sparse BLAS also have issues and we only use some of them.
> You also need 2D reductions, scan operations, slicing, element-wise
> transcendental functions and operators, many kinds of sort, random number
> generators etc, and some kind of memory management strategy. Some of this
> was layered on top of Thrust in BIDMat, but most had to be written from
> scratch. Its all been rooflined, typically to memory throughput of current
> GPUs (around 200 GB/s).
> When you have all this you can write Learning Algorithms in the same
> high-level primitives available in Breeze or Numpy/Scipy. Its literally the
> same in BIDMat, since the generic matrix operations are implemented on both
> CPU and GPU, so the same code runs on either platform.
> A lesser known fact is that GPUs are around 10x faster for *all* those
> operations, not just dense BLAS. Its mostly due to faster streaming memory
> speeds, but some kernels (random number generation and transcendentals) are
> more than an order of magnitude thanks to some specialized hardware for
> power series on the GPU chip.
> When you have all this there is no need to move data back and forth across
> the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
> and feed them to the available GPUs. Most models fit comfortably in GPU
> memory these days (4-12 GB). With minibatch algorithms you can push TBs of
> data through the GPU this way.
> --
> View this message in context:
> Sent from the Apache Spark Developers List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message