spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: Rdd of Rdds
Date Wed, 22 Oct 2014 20:52:01 GMT
Another approach could be to create artificial keys for each RDD and
convert to PairRDDs. So your first RDD becomes
JavaPairRDD<Int,String> rdd1 with values 1,"1" ; 1,"2" and so on

Second RDD becomes rdd2 is 2, "a"; 2, "b";2,"c"

You can union the two RDDs, groupByKey, countByKey etc and maybe achieve
what you are trying to do. Sorry this is just a hypothesis, as I am not
entirely sure about what you are trying to achieve. Ideally, I would think
hard whether multiple RDDs are indeed needed, just as Sean pointed out.

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>



On Wed, Oct 22, 2014 at 8:35 PM, Sean Owen <sowen@cloudera.com> wrote:

> No, there's no such thing as an RDD of RDDs in Spark.
> Here though, why not just operate on an RDD of Lists? or a List of RDDs?
> Usually one of these two is the right approach whenever you feel
> inclined to operate on an RDD of RDDs.
>
> On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini <tomer.ben@gmail.com>
> wrote:
> > Hello,
> >
> > I would like to parallelize my work on multiple RDDs I have. I wanted
> > to know if spark can support a "foreach" on an RDD of RDDs. Here's a
> > java example:
> >
> >     public static void main(String[] args) {
> >
> >         SparkConf sparkConf = new SparkConf().setAppName("testapp");
> >         sparkConf.setMaster("local");
> >
> >     JavaSparkContext sc = new JavaSparkContext(sparkConf);
> >
> >
> >     List<String> list = Arrays.asList(new String[] {"1", "2", "3"});
> >     JavaRDD<String> rdd = sc.parallelize(list);
> >
> >     List<String> list1 = Arrays.asList(new String[] {"a", "b", "c"});
> >    JavaRDD<String> rdd1 = sc.parallelize(list1);
> >
> >     List<JavaRDD<String>> rddList = new ArrayList<JavaRDD<String>>();
> >     rddList.add(rdd);
> >     rddList.add(rdd1);
> >
> >
> >     JavaRDD<JavaRDD<String>> rddOfRdds = sc.parallelize(rddList);
> >     System.out.println(rddOfRdds.count());
> >
> >
> >     rddOfRdds.foreach(new VoidFunction<JavaRDD<String>>() {
> >
> >    @Override
> >     public void call(JavaRDD<String> t) throws Exception {
> >          System.out.println(t.count());
> >     }
> >
> >    });
> > }
> >
> > From this code I'm getting a NullPointerException on the internal count
> method:
> >
> > Exception in thread "main" org.apache.spark.SparkException: Job
> > aborted due to stage failure: Task 1.0:0 failed 1 times, most recent
> > failure: Exception failure in TID 1 on host localhost:
> > java.lang.NullPointerException
> >
> >         org.apache.spark.rdd.RDD.count(RDD.scala:861)
> >
> >
>  org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365)
> >
> >         org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29)
> >
> > Help will be appreciated.
> >
> > Thanks,
> > Tomer
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message