spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geoff Von Allmen <ge...@ibleducation.com>
Subject Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?
Date Tue, 20 Mar 2018 13:09:39 GMT
The following
<http://spark.apache.org/docs/latest/configuration.html#spark-streaming>
settings
may be what you’re looking for:

   - spark.streaming.backpressure.enabled
   - spark.streaming.backpressure.initialRate
   - spark.streaming.receiver.maxRate
   - spark.streaming.kafka.maxRatePerPartition

​

On Mon, Mar 19, 2018 at 5:27 PM, kant kodali <kanth909@gmail.com> wrote:

> Yes it indeed makes sense! Is there a way to get incremental counts when I
> start from 0 and go through 10M records? perhaps count for every micro
> batch or something?
>
> On Mon, Mar 19, 2018 at 1:57 PM, Geoff Von Allmen <geoff@ibleducation.com>
> wrote:
>
>> Trigger does not mean report the current solution every 'trigger
>> seconds'. It means it will attempt to fetch new data and process it no
>> faster than trigger seconds intervals.
>>
>> If you're reading from the beginning and you've got 10M entries in kafka,
>> it's likely pulling everything down then processing it completely and
>> giving you an initial output. From here on out, it will check kafka every 1
>> second for new data and process it, showing you only the updated rows. So
>> the initial read will give you the entire output since there is nothing to
>> be 'updating' from. If you add data to kafka now that the streaming job has
>> completed it's first batch (and leave it running), it will then show you
>> the new/updated rows since the last batch every 1 second (assuming it can
>> fetch + process in that time span).
>>
>> If the combined fetch + processing time is > the trigger time, you will
>> notice warnings that it is 'falling behind' (I forget the exact verbiage,
>> but something to the effect of the calculation took XX time and is falling
>> behind). In that case, it will immediately check kafka for new messages and
>> begin processing the next batch (if new messages exist).
>>
>> Hope that makes sense -
>>
>>
>> On Mon, Mar 19, 2018 at 13:36 kant kodali <kanth909@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I have 10 million records in my Kafka and I am just trying to
>>> spark.sql(select count(*) from kafka_view). I am reading from kafka and
>>> writing to kafka.
>>>
>>> My writeStream is set to "update" mode and trigger interval of one
>>> second (Trigger.ProcessingTime(1000)). I expect the counts to be
>>> printed every second but looks like it would print after going through all
>>> 10M. why?
>>>
>>> Also, it seems to take forever whereas Linux wc of 10M rows would take
>>> 30 seconds.
>>>
>>> Thanks!
>>>
>>
>

Mime
View raw message