systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fschue...@posteo.de
Subject Re: Performance differences between SystemML LibMatrixMult and Breeze with native BLAS
Date Thu, 01 Dec 2016 01:29:20 GMT
Hi Matthias,

thanks for the clarification as to why the current situation exists.

As I said: I didn't really run any serious benchmarks here. These are 
simple comparisons that I ran to get a general feeling of where we are 
with speed. The numbers for Breeze without native are definitely slower 
than SystemML (about ~3-4x) but that is not surprising and also not what 
I wanted to look at. Breeze is known to be slow ;)

The problem I wanted to address here was actually in the context of DL 
and so dense/dense was my major concern. It's clear that the 
benefit/penalty of native operations heavily depends on other factors. 
Just out of curiosity I tried Matrix/Vector multiply and SystemML is 
actually 2x faster than native BLAS. But then this is 1ms vs. 2ms which 
might even be within a standard deviation (didn't compute that though).

But anyways - I didn't want to argue for a major change here, I was 
interested in a more systematic analysis of where we are compared to 
other (a) low-level linear algebra libraries (b) DL frameworks. To do 
this would definitely require setting up a more "scientific" benchmark 
suite than my little test here.

Felix

Am 01.12.2016 01:00 schrieb Matthias Boehm:
> ok, then let's sort this out one by one
> 
> 1) Benchmarks: There are a couple of things we should be aware of for
> these native/java benchmarks. First, please specify k as the number of
> logical cores on your machine and use a sufficiently large heap with
> Xms=Xmx and Xmn=0.1*Xmx. Second, exclude the initial warmup runs for
> JIT compilation or outliers where GC happened from these measurements.
> 
> 2) Breeze Comparison: Please also get the breeze numbers without
> native BLAS libraries as another baseline with comparable runtime
> platform.
> 
> 3) Bigger Picture: Just to clarify the overall question here - of
> course native BLAS libraries are expected to be faster for squared (or
> similar) dense matrix multiply, as current JDKs usually only compile
> scalar but no packed SIMD instructions for these operations. How much
> depends on the architecture. On older architectures with 128bit and
> 256bit vector units, it was not too problematic. But the trend
> continues and hence it is worth thinking about it if nothing happens
> on the JDK front. The reasons why we decided for platform independence
> in the past were as follows:
> 
> (a) Squared dense matrix multiply is not a common operation (other
> than in DL). Much more common are memory-bandwidth bound matrix-vector
> multiplications and there it actually leads to a 3x slowdown copying
> your data out to a native library.
> (b) In end-to-end algorithms, especially on large-scale scenarios, we
> often see other factors dominating performance.
> (c) Keeping the build and deployment simple without the dependency to
> native libraries was the logical conclusion given (a) and (b).
> (d) There are also workarounds: A user can always (and we did this in
> the past with certain LAPACK functions), define an external function
> and call there whatever library she wants.
> 
> 
> Regards,
> Matthias
> 
> On 12/1/2016 12:27 AM, fschueler@posteo.de wrote:
>> This is the printout from 50 iterations with timings decommented:
>> 
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 465.897145
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 389.913848
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 426.539142
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 391.878792
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 349.830464
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 284.751495
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 337.790165
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 363.655144
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 334.348717
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 745.822571
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 1257.83537
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 313.253455
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 268.226473
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 252.079117
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 254.162898
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 257.962804
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 279.462628
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 240.553724
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 269.316559
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 245.755306
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 266.528604
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 240.022494
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 269.964251
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 246.011221
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 309.174575
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 254.311429
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 262.97415
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 256.096419
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 293.975642
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 262.577342
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 287.840992
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 293.495411
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 253.541925
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 293.485217
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 266.114958
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 260.231448
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 260.012622
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 267.912608
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 264.265422
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 276.937746
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 261.649393
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 245.334056
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 258.506884
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 243.960491
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 251.801208
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 271.235477
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 275.290229
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 251.290325
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 265.851277
>> MM k=8 (false,1000,1000,1000000)x(false,1000,1000,1000000) in 
>> 240.902494
>> 
>> Am 01.12.2016 00:08 schrieb Matthias Boehm:
>>> Could you please make sure you're comparing the right thing. Even on
>>> old sandy bridge CPUs our matrix mult for 1kx1k usually takes 
>>> 40-50ms.
>>> We also did the same experiments with larger matrices and SystemML 
>>> was
>>> about 2x faster compared to Breeze. Please decomment the timings in
>>> LibMatrixMult.matrixMult and double check the timing as well as that
>>> we're actually comparing dense matrix multiply.
>>> 
>>> Regards,
>>> Matthias
>>> 
>>> On 11/30/2016 11:54 PM, fschueler@posteo.de wrote:
>>>> Hi all,
>>>> 
>>>> I have run a very quick comparison between SystemML's LibMatrixMult 
>>>> and
>>>> Breeze matrix multiplication using native BLAS (OpenBLAS through
>>>> netlib-java). As per my very small comparison I get the result that
>>>> there is a performance difference for dense-dense Matrices of size 
>>>> 1000
>>>> x 1000 (our default blocksize) with Breeze being about 5-6 times 
>>>> faster
>>>> here. The code I used can be found here:
>>>> https://github.com/fschueler/incubator-systemml/blob/model_types/src/test/scala/org/apache/sysml/api/linalg/layout/local/SystemMLLocalBackendTest.scala
>>>> 
>>>> 
>>>> 
>>>> Running this code with 50 iterations each gives me for example 
>>>> average
>>>> times of:
>>>> Breeze:         49.74 ms
>>>> SystemML:   363.44 ms
>>>> 
>>>> I don't want to say this is true for every operation, but those 
>>>> results
>>>> let us form the hypothesis that native BLAS operations can lead to a
>>>> significant speedup for certain operations which is worth testing 
>>>> with
>>>> more advanced benchmarks.
>>>> 
>>>> Btw: I am definitely not saying we should use Breeze here. I am more
>>>> looking at native BLAS and LAPACK implementations in general (as
>>>> provided by OpenBLAS, MKL, etc.).
>>>> 
>>>> Let me know what you think!
>>>> Felix
>>>> 
>> 

Mime
View raw message