spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From surender kumar <skiit...@yahoo.co.uk.INVALID>
Subject Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working
Date Thu, 12 Apr 2018 10:25:41 GMT
Thanks Matteo, this should work!
-Surender 

    On Thursday, 12 April, 2018, 1:13:38 PM IST, Matteo Cossu <elcossu@gmail.com> wrote:
 
 
 I don't think it's trivial. Anyway, the naive solution would be a cross join between user
x items. But this can be very very expensive. I've encountered once a similar problem, here
how I solved it:   
   - create a new RDD with (itemID, index) where the index is a unique integer between 0 and
the number of items   

   - for every user sample n items by generating randomly n distinct integers between 0 and
the number of items (e.g. with rand.randint()), so you have a new RDD (userID, [sample_items])
   - flatten all the list in the previously created RDD and join them back with the RDD with
(itemID, index) using index as join attribute
You can do the same things with DataFrame using UDFs.
On 11 April 2018 at 23:01, surender kumar <skiitd80@yahoo.co.uk> wrote:

right, this is what I did when I said I tried to persist and create an RDD out of it to sample
from. But how to do for each user?You have one rdd of users on one hand and rdd of items on
the other. How to go from here? Am I missing something trivial?  

    On Thursday, 12 April, 2018, 2:10:51 AM IST, Matteo Cossu <elcossu@gmail.com> wrote:
 
 
 Why broadcasting this list then? You should use an RDD or DataFrame. For example, RDD has
a method sample() that returns a random sample from it.
On 11 April 2018 at 22:34, surender kumar <skiitd80@yahoo.co.uk.invalid> wrote:

I'm using pySpark.I've list of 1 million items (all float values ) and 1 million users. for
each user I want to sample randomly some items from the item list.Broadcasting the item list
results in Outofmemory error on the driver, tried setting driver memory till 10G.  I tried
to persist this array on disk but I'm not able to figure out a way to read the same on the
workers.
Any suggestion would be appreciated.

  

  
Mime
View raw message