spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Henry <londonjava...@gmail.com>
Subject Re: Matrix multiplication and cluster / partition / blocks configuration
Date Tue, 08 Aug 2017 06:56:31 GMT
Hi, John.

I've had similar problems. IIRC, the driver was GCing madly. I don't know
why the driver was doing so much work but I quickly implemented an
alternative approach. The code I wrote belongs to my client but I wrote
something that should be equivalent. It can be found at:

https://github.com/PhillHenry/Algernon

It's not terribly complicated and you could roll-your-own if you prefer
(the rough idea can be found at
http://javaagile.blogspot.co.at/2016/11/an-alternative-approach-to-matrices-in.html).
But anyway, I got good performance this way.

Phill


On Thu, May 11, 2017 at 10:12 PM, John Compitello <johnc@broadinstitute.org>
wrote:

> Hey all,
>
> I’ve found myself in a position where I need to do a relatively large
> matrix multiply (at least, compared to what I normally have to do). I’m
> looking to multiply a 100k by 500k dense matrix by its transpose to yield
> 100k by 100k matrix. I’m trying to do this on Google Cloud, so I don’t have
> any real limits on cluster size or memory. However, I have no idea where to
> begin as far as number of cores / number of partitions / how big to make
> the block size for best performance. Is there anywhere where Spark users
> collect optimal configurations for methods relative to data input size?
> Does anyone have any suggestions? I’ve tried throwing 900 cores at a 100k
> by 100k matrix multiply with 1000 by 1000 sized blocks, and that seemed to
> hang forever and eventually fail.
>
> Thanks ,
>
> John
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message