spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jfcanny <>
Subject Re: Using CUDA within Spark / boosting linear algebra
Date Fri, 13 Mar 2015 16:43:28 GMT
Hi Reynold,
I left Chester with a copy of the slides, so I assume they'll be posted 
on the SF ML or Big Data sites. We have a draft paper under review. I 
can ask the co-authors about arxiv'ing it.

We have a few heuristics for power-law data. One of them is to keep the 
feature set sorted by frequency. Power-law data has roughly the same 
mass in each power-of-two range of feature frequency. By keeping the 
most frequent features together, you get a lot more value out of the 
caches on the device (even GPUs have them, albeit smaller ones). e.g. 
with 100 million features, 1/2 of the feature instances will be in the 
range 1...,10,000. If they're consecutive they will all hit a fast 
cache. Another 1/4 will be in 1,...,1,000,000 hitting the next cache etc.

Another is to subdivide sparse matrices using the vector of elements 
rather than rows or columns. Splitting power-law matrices by either rows 
or columns gives very uneven splits. That means we store sparse matrices 
in coordinate form rather than compressed row or column format.

Other than that, rooflining gives you a goal that you should be able to 
reach. If you arent at the limit, just knowing that gives you a target 
to aim at. You can try profiling the kernel to figure out why its slower 
than it should be. There are a few common reasons (low occupancy, 
imbalanced thread blocks, thread divergence) that you can discover with 
the profiler. Then hopefully you can solve them.


On 3/12/2015 10:56 PM, rxin [via Apache Spark Developers List] wrote:
> Thanks for chiming in, John. I missed your meetup last night - do you 
> have
> any writeups or slides about roofline design? In particular, I'm curious
> about what optimizations are available for power-law dense * sparse? (I
> don't have any background in optimizations)
> On Thu, Mar 12, 2015 at 8:50 PM, jfcanny <[hidden email] 
> </user/SendEmail.jtp?type=node&node=11022&i=0>> wrote:
> > If you're contemplating GPU acceleration in Spark, its important to 
> look
> > beyond BLAS. Dense BLAS probably account for only 10% of the cycles 
> in the
> > datasets we've tested in BIDMach, and we've tried to make them
> > representative of industry machine learning workloads. Unless you're
> > crunching images or audio, the majority of data will be very sparse and
> > power law distributed. You need a good sparse BLAS, and in practice it
> > seems
> > like you need a sparse BLAS tailored for power-law data. We had to 
> write
> > our
> > own since the NVIDIA libraries didnt perform well on typical power-law
> > data.
> > Intel MKL sparse BLAS also have issues and we only use some of them.
> >
> > You also need 2D reductions, scan operations, slicing, element-wise
> > transcendental functions and operators, many kinds of sort, random 
> number
> > generators etc, and some kind of memory management strategy. Some of 
> this
> > was layered on top of Thrust in BIDMat, but most had to be written from
> > scratch. Its all been rooflined, typically to memory throughput of 
> current
> > GPUs (around 200 GB/s).
> >
> > When you have all this you can write Learning Algorithms in the same
> > high-level primitives available in Breeze or Numpy/Scipy. Its 
> literally the
> > same in BIDMat, since the generic matrix operations are implemented 
> on both
> > CPU and GPU, so the same code runs on either platform.
> >
> > A lesser known fact is that GPUs are around 10x faster for *all* those
> > operations, not just dense BLAS. Its mostly due to faster streaming 
> memory
> > speeds, but some kernels (random number generation and 
> transcendentals) are
> > more than an order of magnitude thanks to some specialized hardware for
> > power series on the GPU chip.
> >
> > When you have all this there is no need to move data back and forth 
> across
> > the PCI bus. The CPU only has to pull chunks of data off disk, 
> unpack them,
> > and feed them to the available GPUs. Most models fit comfortably in GPU
> > memory these days (4-12 GB). With minibatch algorithms you can push 
> TBs of
> > data through the GPU this way.
> >
> >
> >
> > --
> > View this message in context:
> > 
> > Sent from the Apache Spark Developers List mailing list archive at
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email] 
> </user/SendEmail.jtp?type=node&node=11022&i=1>
> > For additional commands, e-mail: [hidden email] 
> </user/SendEmail.jtp?type=node&node=11022&i=2>
> >
> >
> ------------------------------------------------------------------------
> If you reply to this email, your message will be added to the 
> discussion below:

> To unsubscribe from Using CUDA within Spark / boosting linear algebra, 
> click here 
> <>.
> <>


View this message in context:
Sent from the Apache Spark Developers List mailing list archive at
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message