spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shushant Arora <shushantaror...@gmail.com>
Subject Re: spark streaming with kinesis
Date Mon, 21 Nov 2016 04:59:09 GMT
Hi

Thanks.
Have a doubt on spark streaming kinesis consumer. Say I have a batch time
of 500 ms and kiensis stream is partitioned on userid(uniformly
distributed).But since IdleTimeBetweenReadsInMillis is set to 1000ms so
Spark receiver nodes will fetch the data at interval of 1 second and store
in InputDstream.

1. When worker executors will fetch the data from receiver at after every
500 ms does its gurantee that 1 userid data will go to one partition and
that to one worker only always ?
2.If not - can I repartition stream data before processing? If yes how-
since JavaDStream has only one method repartition which takes number of
partitions and not the partitioner function ?So it will randomly
repartition the Dstream data.

Thanks








On Tue, Nov 15, 2016 at 8:23 AM, Takeshi Yamamuro <linguin.m.s@gmail.com>
wrote:

> Seems it it not a good design to frequently restart workers in a minute
> because
> their initialization and shutdown take much time as you said
> (e.g., interconnection overheads with dynamodb and graceful shutdown).
>
> Anyway, since this is a kind of questions about the aws kinesis library, so
> you'd better to ask aws guys in their forum or something.
>
> // maropu
>
>
> On Mon, Nov 14, 2016 at 11:20 PM, Shushant Arora <
> shushantarora09@gmail.com> wrote:
>
>> 1.No, I want to implement low level consumer on kinesis stream.
>> so need to stop the worker once it read the latest sequence number sent
>> by driver.
>>
>> 2.What is the cost of frequent register and deregister of worker node. Is
>> that when worker's shutdown is called it will terminate run method but
>> leasecoordinator will wait for 2seconds before releasing the lease. So I
>> cannot deregister a worker in less than 2 seconds ?
>>
>> Thanks!
>>
>>
>>
>> On Mon, Nov 14, 2016 at 7:36 PM, Takeshi Yamamuro <linguin.m.s@gmail.com>
>> wrote:
>>
>>> Is "aws kinesis get-shard-iterator --shard-iterator-type LATEST" not
>>> enough for your usecase?
>>>
>>> On Mon, Nov 14, 2016 at 10:23 PM, Shushant Arora <
>>> shushantarora09@gmail.com> wrote:
>>>
>>>> Thanks!
>>>> Is there a way to get the latest sequence number of all shards of a
>>>> kinesis stream?
>>>>
>>>>
>>>>
>>>> On Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro <
>>>> linguin.m.s@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The time interval can be controlled by `IdleTimeBetweenReadsInMillis`
>>>>> in KinesisClientLibConfiguration though,
>>>>> it is not configurable in the current implementation.
>>>>>
>>>>> The detail can be found in;
>>>>> https://github.com/apache/spark/blob/master/external/kinesis
>>>>> -asl/src/main/scala/org/apache/spark/streaming/kinesis/Kines
>>>>> isReceiver.scala#L152
>>>>>
>>>>> // maropu
>>>>>
>>>>>
>>>>> On Sun, Nov 13, 2016 at 12:08 AM, Shushant Arora <
>>>>> shushantarora09@gmail.com> wrote:
>>>>>
>>>>>> *Hi *
>>>>>>
>>>>>> *is **spark.streaming.blockInterval* for kinesis input stream is
>>>>>> hardcoded to 1 sec or is it configurable ? Time interval at which
receiver
>>>>>> fetched data from kinesis .
>>>>>>
>>>>>> Means stream batch interval cannot be less than *spark.streaming.blockInterval
>>>>>> and this should be configrable , Also is there any minimum value
for
>>>>>> streaming batch interval ?*
>>>>>>
>>>>>> *Thanks*
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Mime
View raw message