spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Sankar <ksanka...@gmail.com>
Subject Re: randomSplit instead of a huge map & reduce ?
Date Sat, 21 Feb 2015 08:29:47 GMT
   - Divide and conquer with reduceByKey (like Ashish mentioned, each pair
   being the key) would work - looks like a "mapReduce with combiners"
   problem. I think reduceByKey would use combiners while aggregateByKey
   wouldn't.
   - Could we optimize this further by using combineByKey directly ?

Cheers
<k/>

On Fri, Feb 20, 2015 at 6:39 PM, Ashish Rangole <arangole@gmail.com> wrote:

> Is there a check you can put in place to not create pairs that aren't in
> your set of 20M pairs? Additionally, once you have your arrays converted to
> pairs you can do aggregateByKey with each pair being the key.
> On Feb 20, 2015 1:57 PM, "shlomib" <shlomi@summerhq.com> wrote:
>
>> Hi,
>>
>> I am new to Spark and I think I missed something very basic.
>>
>> I have the following use case (I use Java and run Spark locally on my
>> laptop):
>>
>>
>> I have a JavaRDD<String[]>
>>
>> - The RDD contains around 72,000 arrays of strings (String[])
>>
>> - Each array contains 80 words (on average).
>>
>>
>> What I want to do is to convert each array into a new array/list of pairs,
>> for example:
>>
>> Input: String[] words = ['a', 'b', 'c']
>>
>> Output: List[<String, Sting>] pairs = [('a', 'b'), (a', 'c'), (b', 'c')]
>>
>> and then I want to count the number of times each pair appeared, so my
>> final
>> output should be something like:
>>
>> Output: List[<String, Sting, Integer>] result = [('a', 'b', 3), (a', 'c',
>> 8), (b', 'c', 10)]
>>
>>
>> The problem:
>>
>> Since each array contains around 80 words, it returns around 3,200 pairs,
>> so
>> after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs
>> to
>> reduce which require way too much memory.
>>
>> (I know I have only around *20,000,000* unique pairs!)
>>
>> I already modified my code and used 'mapPartitions' instead of 'map'. It
>> definitely improved the performance, but I still feel I'm doing something
>> completely wrong.
>>
>>
>> I was wondering if this is the right 'Spark way' to solve this kind of
>> problem, or maybe I should do something like splitting my original RDD
>> into
>> smaller parts (by using randomSplit), then iterate over each part,
>> aggregate
>> the results into some result RDD (by using 'union') and move on to the
>> next
>> part.
>>
>>
>> Can anyone please explain me which solution is better?
>>
>>
>> Thank you very much,
>>
>> Shlomi.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Mime
View raw message