spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noorul Islam K M <noo...@noorul.com>
Subject Re: Combining Many RDDs
Date Fri, 27 Mar 2015 03:32:04 GMT
Yang Chen <yang@yang-cs.com> writes:

> Hi Noorul,
>
> Thank you for your suggestion. I tried that, but ran out of memory. I did
> some search and found some suggestions
> that we should try to avoid rdd.union(
> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
> ).
> I will try to come up with some other ways.
>

I think you are using rdd.union(), but I was referring to
SparkContext.union(). I am not sure about the number of RDDs that you
have but I had no issues with memory when I used it to combine 2000
RDDs. Having said that I had other performance issues with spark
cassandra connector.

Thanks and Regards
Noorul

>
> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noorul@noorul.com> wrote:
>
>> sparkx <yang@yang-cs.com> writes:
>>
>> > Hi,
>> >
>> > I have a Spark job and a dataset of 0.5 Million items. Each item performs
>> > some sort of computation (joining a shared external dataset, if that does
>> > matter) and produces an RDD containing 20-500 result items. Now I would
>> like
>> > to combine all these RDDs and perform a next job. What I have found out
>> is
>> > that the computation itself is quite fast, but combining these RDDs takes
>> > much longer time.
>> >
>> >     val result = data        // 0.5M data items
>> >       .map(compute(_))   // Produces an RDD - fast
>> >       .reduce(_ ++ _)      // Combining RDDs - slow
>> >
>> > I have also tried to collect results from compute(_) and use a flatMap,
>> but
>> > that is also slow.
>> >
>> > Is there a way to efficiently do this? I'm thinking about writing this
>> > result to HDFS and reading from disk for the next job, but am not sure if
>> > that's a preferred way in Spark.
>> >
>>
>> Are you looking for SparkContext.union() [1] ?
>>
>> This is not performing well with spark cassandra connector. I am not
>> sure whether this will help you.
>>
>> Thanks and Regards
>> Noorul
>>
>> [1]
>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message