spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>
Subject createDirectStream parallelism
Date Thu, 18 Aug 2016 14:22:45 GMT
We are using createDirectStream api  to receive messages from 48
partitioned topic. I am setting up  --num-executors 48  & --executor-cores
1 in spark-submit

All partitions were parallely received  and corresponding RDDs in
foreachRDD were executed in parallel. But when I join a transformed RDD
jsonDF (code below) with another RDD, i could see that they are not
executed  in parallell for each partitions. There were more shuffle read
and writes and no of executors executing were less than the no of
partitions. I mean, no of executors were not equal to no of partitions when
join is executing.

How could I make sure to execute join in all executors. Can anyone provide
help.

kstream.foreachRDD { rdd =>

val  jsonDF = sqlContext.read.json(rdd).toDF
.....
...
val metaDF = ssc.sparkContext.textfile("file").toDF

val join  = jsonDF.join(metaDF)

join.map (....).count

}

Mime
View raw message