spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang Chen <y...@yang-cs.com>
Subject Re: Combining Many RDDs
Date Fri, 27 Mar 2015 15:53:15 GMT
Hi Kelvin,

Thank you. That works for me. I wrote my own joins that produced Scala
collections, instead of using rdd.join.

Regards,
Yang

On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu <2dot7kelvin@gmail.com> wrote:

> Hi, I used union() before and yes it may be slow sometimes. I _guess_ your
> variable 'data' is a Scala collection and compute() returns an RDD. Right?
> If yes, I tried the approach below to operate on one RDD only during the
> whole computation (Yes, I also saw that too many RDD hurt performance).
>
> Change compute() to return Scala collection instead of RDD.
>
>     val result = sc.parallelize(data)        // Create and partition the
> 0.5M items in a single RDD.
>       .flatMap(compute(_))   // You still have only one RDD with each item
> joined with external data already
>
> Hope this help.
>
> Kelvin
>
> On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen <yang@yang-cs.com> wrote:
>
>> Hi Mark,
>>
>> That's true, but in neither way can I combine the RDDs, so I have to
>> avoid unions.
>>
>> Thanks,
>> Yang
>>
>> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <mark@clearstorydata.com>
>> wrote:
>>
>>> RDD#union is not the same thing as SparkContext#union
>>>
>>> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <yang@yang-cs.com> wrote:
>>>
>>>> Hi Noorul,
>>>>
>>>> Thank you for your suggestion. I tried that, but ran out of memory. I
>>>> did some search and found some suggestions
>>>> that we should try to avoid rdd.union(
>>>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>>>> ).
>>>> I will try to come up with some other ways.
>>>>
>>>> Thank you,
>>>> Yang
>>>>
>>>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noorul@noorul.com>
>>>> wrote:
>>>>
>>>>> sparkx <yang@yang-cs.com> writes:
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>>>> performs
>>>>> > some sort of computation (joining a shared external dataset, if
that
>>>>> does
>>>>> > matter) and produces an RDD containing 20-500 result items. Now
I
>>>>> would like
>>>>> > to combine all these RDDs and perform a next job. What I have found
>>>>> out is
>>>>> > that the computation itself is quite fast, but combining these RDDs
>>>>> takes
>>>>> > much longer time.
>>>>> >
>>>>> >     val result = data        // 0.5M data items
>>>>> >       .map(compute(_))   // Produces an RDD - fast
>>>>> >       .reduce(_ ++ _)      // Combining RDDs - slow
>>>>> >
>>>>> > I have also tried to collect results from compute(_) and use a
>>>>> flatMap, but
>>>>> > that is also slow.
>>>>> >
>>>>> > Is there a way to efficiently do this? I'm thinking about writing
>>>>> this
>>>>> > result to HDFS and reading from disk for the next job, but am not
>>>>> sure if
>>>>> > that's a preferred way in Spark.
>>>>> >
>>>>>
>>>>> Are you looking for SparkContext.union() [1] ?
>>>>>
>>>>> This is not performing well with spark cassandra connector. I am not
>>>>> sure whether this will help you.
>>>>>
>>>>> Thanks and Regards
>>>>> Noorul
>>>>>
>>>>> [1]
>>>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Yang Chen
>>>> Dept. of CISE, University of Florida
>>>> Mail: yang@yang-cs.com
>>>> Web: www.cise.ufl.edu/~yang
>>>>
>>>
>>>
>>
>>
>> --
>> Yang Chen
>> Dept. of CISE, University of Florida
>> Mail: yang@yang-cs.com
>> Web: www.cise.ufl.edu/~yang
>>
>
>


-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: yang@yang-cs.com
Web: www.cise.ufl.edu/~yang

Mime
View raw message