spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Muttineni, Vinay" <>
Subject Better way to use a large data set?
Date Fri, 20 Jun 2014 15:29:04 GMT
Hi All,
I have a 8 mill row, 500 column data set, which is derived by reading a text file and doing
a filter, flatMap operation to weed out some anomalies.
Now, I have a process which has to run through all 500 columns, do couple of map, reduce,
forEach operations on the data set and return some statistics as output. I have thought of
the following approaches.
Approach 1:

i)                    Read the DataSet from textfile, do some operations, get a RDD. Use toArray
or collect on this RDD and broadcast it.

ii)                   Do a flatMap on a range of numbers, this range being equivalent to the
number of columns.

iii)                 In each flatMap operation, perform the required operations on the broadcast
variable to derive the stats, return the array of stats

Questions about this approach:

1)      Is there a preference amongst toArray and collect?

2)      Can I not directly broadcast a RDD instead of first collecting it and broadcasting
it? I tried this, but I got a serialization exception.

3)      When I use sc.parallelize on the broadcast dataset, would it be a problem if there
isn't enough space to store it in-memory?

Approach 2:

Instead of reading the textfile, doing some operations and then broadcasting it, I was planning
to do the read part within each of the 500 steps of the flatMap (assuming I have 500 columns)

Is this better than Approach 1? In Approach 1, I'd have to read once and broadcast whilst
here, I'd have to read 500 times.

Approach 3:
Do a transpose of the dataset and then flatMap on the transposed matrix.

Could someone please point out the best approach from above, or if there's a better way to
solve this?
Thank you for the help!

View raw message