spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Jay <bill.jaypeter...@gmail.com>
Subject Re: spark streaming rate limiting from kafka
Date Tue, 22 Jul 2014 17:58:41 GMT
Hi Tobias,

I tried to use 10 as numPartition. The number of executors allocated is the
number of DStream. Therefore, it seems the parameter does not spread data
into many partitions. In order to to that, it seems we have to do
repartition. If numPartitions will distribute the data to multiple
executors/partitions, then I will be able to save the running time incurred
by repartition.

Bill




On Mon, Jul 21, 2014 at 6:43 PM, Tobias Pfeiffer <tgp@preferred.jp> wrote:

> Bill,
>
> numPartitions means the number of Spark partitions that the data received
> from Kafka will be split to. It has nothing to do with Kafka partitions, as
> far as I know.
>
> If you create multiple Kafka consumers, it doesn't seem like you can
> specify which consumer will consume which Kafka partitions. Instead, Kafka
> (at least with the interface that is exposed by the Spark Streaming API)
> will do something called rebalance and assign Kafka partitions to consumers
> evenly, you can see this in the client logs.
>
> When using multiple Kafka consumers with auto.offset.reset = true, please
> expect to run into this one:
> https://issues.apache.org/jira/browse/SPARK-2383
>
> Tobias
>
>
> On Tue, Jul 22, 2014 at 3:40 AM, Bill Jay <bill.jaypeterson@gmail.com>
> wrote:
>
>> Hi Tathagata,
>>
>> I am currentlycreating multiple DStream to consumefrom different topics.
>> How can I let each consumer consume from different partitions. I find the
>> following parameters from Spark API:
>>
>> createStream[K, V, U <: Decoder[_], T <: Decoder[_]](jssc:
>> JavaStreamingContext
>> <https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.html>
>> , keyTypeClass: Class[K], valueTypeClass: Class[V],keyDecoderClass: Class
>> [U], valueDecoderClass: Class[T], kafkaParams: Map[String, String],
>> topics: Map[String, Integer],storageLevel: StorageLevel
>> <https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/storage/StorageLevel.html>
>> ): JavaPairReceiverInputDStream
>> <https://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/streaming/api/java/JavaPairReceiverInputDStream.html>
>> [K, V]
>>
>> Create an input stream that pulls messages form a Kafka Broker.
>>
>>
>>
>>
>> The topics parameter is:
>> *Map of (topic_name -> numPartitions) to consume. Each partition is
>> consumed in its own thread*
>>
>> Does numPartitions mean the total number of partitions to consume from
>> topic_name or the index of the partition? How can we specify for each
>> createStream which partition of the Kafka topic to consume? I think if so,
>> I will get a lot of parallelism from the source of the data. Thanks!
>>
>> Bill
>>
>> On Thu, Jul 17, 2014 at 6:21 PM, Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>>> You can create multiple kafka stream to partition your topics across
>>> them, which will run multiple receivers or multiple executors. This is
>>> covered in the Spark streaming guide.
>>> <http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving>
>>>
>>> And for the purpose of this thread, to answer the original question, we now
>>> have the ability
>>> <https://issues.apache.org/jira/browse/SPARK-1854?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20priority%20DESC>
>>> to limit the receiving rate. Its in the master branch, and will be
>>> available in Spark 1.1. It basically sets the limits at the receiver level
>>> (so applies to all sources) on what is the max records per second that can
>>> will be received by the receiver.
>>>
>>> TD
>>>
>>>
>>> On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer <tgp@preferred.jp>
>>> wrote:
>>>
>>>> Bill,
>>>>
>>>> are you saying, after repartition(400), you have 400 partitions on one
>>>> host and the other hosts receive nothing of the data?
>>>>
>>>> Tobias
>>>>
>>>>
>>>> On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay <bill.jaypeterson@gmail.com>
>>>> wrote:
>>>>
>>>>> I also have an issue consuming from Kafka. When I consume from Kafka,
>>>>> there are always a single executor working on this job. Even I use
>>>>> repartition, it seems that there is still a single executor. Does anyone
>>>>> has an idea how to add parallelism to this job?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 17, 2014 at 2:06 PM, Chen Song <chen.song.82@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Luis and Tobias.
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer <tgp@preferred.jp>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Wed, Jul 2, 2014 at 1:57 AM, Chen Song <chen.song.82@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> * Is there a way to control how far Kafka Dstream can read
on
>>>>>>>> topic-partition (via offset for example). By setting this
to a small
>>>>>>>> number, it will force DStream to read less data initially.
>>>>>>>>
>>>>>>>
>>>>>>> Please see the post at
>>>>>>>
>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCAPH-c_M2ppurJx-n_TEhh0BVqe_6LA-RVgtRF1K-LWrMMe+1gQ@mail.gmail.com%3E
>>>>>>> Kafka's auto.offset.reset parameter may be what you are looking
for.
>>>>>>>
>>>>>>> Tobias
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Chen Song
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message