spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Rodríguez Hortalá <juan.rodriguez.hort...@gmail.com>
Subject Number of partitions in RDD for input DStreams
Date Wed, 12 Nov 2014 11:11:53 GMT
Hi list,

In an excelent blog post on Kafka and Spark Streaming integrartion (
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/),
Michael Noll poses an assumption about the number of partitions of the RDDs
created by input DStreams. He says his hypothesis is that the number por
partitions per RDD is batchInterval / spark.streaming.blockInterval. I
guess this is based on the following extract from the Spark Streaming
Programming Guide at section"Level of Parallelism of Data Receiving"

"Another parameter that should be considered is the receiver’s blocking
interval. For most receivers, the received data is coalesced together into
large blocks of data before storing inside Spark’s memory. The number of
blocks in each batch determines the number of tasks that will be used to
process those the received data in a map-like transformation. This blocking
interval is determined by the configuration parameter
<https://spark.apache.org/docs/1.1.0/configuration.html>
spark.streaming.blockInterval and the default value is 200 milliseconds."

Could someone confirm whether that hypotesis is true or false? And if it is
false, is there any way to know the number of partitions per RDD for an
input DStream?

Thansk a lot for your help,

Greetings,

Juan Rodriguez

Mime
View raw message