Hi Mark,That's true, but in neither way can I combine the RDDs, so I have to avoid unions.Thanks,YangOn Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <email@example.com> wrote:RDD#union is not the same thing as SparkContext#unionOn Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <firstname.lastname@example.org> wrote:Hi Noorul,Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestionsthat we should try to avoid rdd.union(http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark).I will try to come up with some other ways.Thank you,YangOn Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <email@example.com> wrote:sparkx <firstname.lastname@example.org> writes:
> I have a Spark job and a dataset of 0.5 Million items. Each item performs
> some sort of computation (joining a shared external dataset, if that does
> matter) and produces an RDD containing 20-500 result items. Now I would like
> to combine all these RDDs and perform a next job. What I have found out is
> that the computation itself is quite fast, but combining these RDDs takes
> much longer time.
> val result = data // 0.5M data items
> .map(compute(_)) // Produces an RDD - fast
> .reduce(_ ++ _) // Combining RDDs - slow
> I have also tried to collect results from compute(_) and use a flatMap, but
> that is also slow.
> Is there a way to efficiently do this? I'm thinking about writing this
> result to HDFS and reading from disk for the next job, but am not sure if
> that's a preferred way in Spark.
Are you looking for SparkContext.union()  ?
This is not performing well with spark cassandra connector. I am not
sure whether this will help you.
Thanks and Regards