Hi, I used union() before and yes it may be slow sometimes. I _guess_ your variable 'data' is a Scala collection and compute() returns an RDD. Right? If yes, I tried the approach below to operate on one RDD only during the whole computation (Yes, I also saw that too many RDD hurt performance).

Change compute() to return Scala collection instead of RDD.

    val result = sc.parallelize(data)        // Create and partition the 0.5M items in a single RDD. 
      .flatMap(compute(_))   // You still have only one RDD with each item joined with external data already

Hope this help.

Kelvin

On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen <yang@yang-cs.com> wrote:
Hi Mark,

That's true, but in neither way can I combine the RDDs, so I have to avoid unions.

Thanks,
Yang

On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <mark@clearstorydata.com> wrote:
RDD#union is not the same thing as SparkContext#union

On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <yang@yang-cs.com> wrote:
Hi Noorul,

Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions
I will try to come up with some other ways.

Thank you,
Yang

On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noorul@noorul.com> wrote:
sparkx <yang@yang-cs.com> writes:

> Hi,
>
> I have a Spark job and a dataset of 0.5 Million items. Each item performs
> some sort of computation (joining a shared external dataset, if that does
> matter) and produces an RDD containing 20-500 result items. Now I would like
> to combine all these RDDs and perform a next job. What I have found out is
> that the computation itself is quite fast, but combining these RDDs takes
> much longer time.
>
>     val result = data        // 0.5M data items
>       .map(compute(_))   // Produces an RDD - fast
>       .reduce(_ ++ _)      // Combining RDDs - slow
>
> I have also tried to collect results from compute(_) and use a flatMap, but
> that is also slow.
>
> Is there a way to efficiently do this? I'm thinking about writing this
> result to HDFS and reading from disk for the next job, but am not sure if
> that's a preferred way in Spark.
>

Are you looking for SparkContext.union() [1] ?

This is not performing well with spark cassandra connector. I am not
sure whether this will help you.

Thanks and Regards
Noorul

[1] http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext



--
Yang Chen
Dept. of CISE, University of Florida
Mail: yang@yang-cs.com
Web: www.cise.ufl.edu/~yang




--
Yang Chen
Dept. of CISE, University of Florida
Mail: yang@yang-cs.com
Web: www.cise.ufl.edu/~yang