spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang Chen <y...@yang-cs.com>
Subject Re: Combining Many RDDs
Date Thu, 26 Mar 2015 21:37:45 GMT
Hi Mark,

That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.

Thanks,
Yang

On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <mark@clearstorydata.com>
wrote:

> RDD#union is not the same thing as SparkContext#union
>
> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <yang@yang-cs.com> wrote:
>
>> Hi Noorul,
>>
>> Thank you for your suggestion. I tried that, but ran out of memory. I did
>> some search and found some suggestions
>> that we should try to avoid rdd.union(
>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>> ).
>> I will try to come up with some other ways.
>>
>> Thank you,
>> Yang
>>
>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noorul@noorul.com>
>> wrote:
>>
>>> sparkx <yang@yang-cs.com> writes:
>>>
>>> > Hi,
>>> >
>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>> performs
>>> > some sort of computation (joining a shared external dataset, if that
>>> does
>>> > matter) and produces an RDD containing 20-500 result items. Now I
>>> would like
>>> > to combine all these RDDs and perform a next job. What I have found
>>> out is
>>> > that the computation itself is quite fast, but combining these RDDs
>>> takes
>>> > much longer time.
>>> >
>>> >     val result = data        // 0.5M data items
>>> >       .map(compute(_))   // Produces an RDD - fast
>>> >       .reduce(_ ++ _)      // Combining RDDs - slow
>>> >
>>> > I have also tried to collect results from compute(_) and use a
>>> flatMap, but
>>> > that is also slow.
>>> >
>>> > Is there a way to efficiently do this? I'm thinking about writing this
>>> > result to HDFS and reading from disk for the next job, but am not sure
>>> if
>>> > that's a preferred way in Spark.
>>> >
>>>
>>> Are you looking for SparkContext.union() [1] ?
>>>
>>> This is not performing well with spark cassandra connector. I am not
>>> sure whether this will help you.
>>>
>>> Thanks and Regards
>>> Noorul
>>>
>>> [1]
>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>
>>
>>
>>
>> --
>> Yang Chen
>> Dept. of CISE, University of Florida
>> Mail: yang@yang-cs.com
>> Web: www.cise.ufl.edu/~yang
>>
>
>


-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: yang@yang-cs.com
Web: www.cise.ufl.edu/~yang

Mime
View raw message