spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noorul Islam K M <>
Subject Re: Combining Many RDDs
Date Thu, 26 Mar 2015 17:13:00 GMT
sparkx <> writes:

> Hi,
> I have a Spark job and a dataset of 0.5 Million items. Each item performs
> some sort of computation (joining a shared external dataset, if that does
> matter) and produces an RDD containing 20-500 result items. Now I would like
> to combine all these RDDs and perform a next job. What I have found out is
> that the computation itself is quite fast, but combining these RDDs takes
> much longer time.
>     val result = data        // 0.5M data items
>       .map(compute(_))   // Produces an RDD - fast
>       .reduce(_ ++ _)      // Combining RDDs - slow
> I have also tried to collect results from compute(_) and use a flatMap, but
> that is also slow.
> Is there a way to efficiently do this? I'm thinking about writing this
> result to HDFS and reading from disk for the next job, but am not sure if
> that's a preferred way in Spark.

Are you looking for SparkContext.union() [1] ?

This is not performing well with spark cassandra connector. I am not
sure whether this will help you.

Thanks and Regards


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message