spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: Structured Streaming: multiple sinks
Date Fri, 25 Aug 2017 08:59:56 GMT
My apologies Chris. Somehow I have not received the first email by OP, and
hence thought our answers to OP as cryptic questions. :/
I found the full thread on nabble. I agree with your analysis of OP's
question 1.


On Fri, Aug 25, 2017 at 12:48 AM, Chris Bowden <cbcwebdev@gmail.com> wrote:

> Tathagata, thanks for filling in context for other readers on 2a and 2b, I
> summarized too much in hindsight.
>
> Regarding the OP's first question, I was hinting it is quite natural to
> chain processes via kafka. If you are already interested in writing
> processed data to kafka, why add complexity to a job by having it commit
> processed data to kafka and s3 vs. simply moving the processed data from
> kafka out to s3 as needed. Perhaps the OP's thread got lost in context
> based on how I responded.
>
> 1) We are consuming from  kafka using  structured streaming and  writing
> the processed data set to s3.
> We also want to write the processed data to kafka moving forward, is it
> possible to do it from the same streaming query ? (spark  version 2.1.1)
>
> Streaming queries are currently bound to a single sink, so multiplexing
> the write with existing sinks via the <same> streaming query isn't possible
> AFAIK. Arguably you can reuse the "processed data" DAG by starting multiple
> sinks against it, though you will effectively process the data twice on
> different "schedules" since each sink will effectively have its own
> instance of StreamExecution, TriggerExecutor, etc. If you *really* wanted
> to do one pass of the data and process the same exact block of data per
> micro batch you could implement it via foreach or a custom sink which
> writes to kafka and s3, but I wouldn't recommend it. As stated above, it is
> quite natural to chain processes via kafka.
>
> On Thu, Aug 24, 2017 at 11:03 PM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> Responses inline.
>>
>> On Thu, Aug 24, 2017 at 7:16 PM, cbowden <cbcwebdev@gmail.com> wrote:
>>
>>> 1. would it not be more natural to write processed to kafka and sink
>>> processed from kafka to s3?
>>>
>>
>> I am sorry i dont fully understand this question. Could you please
>> elaborate further, as in, what is more natural than what?
>>
>>
>>> 2a. addBatch is the time Sink#addBatch took as measured by
>>> StreamExecution.
>>>
>>
>> Yes. This essentially includes the time taken to compute the output and
>> finish writing the output to the sink.
>> (**to give some context for other readers, this person is referring to
>> the different time durations reported through StreamingQuery.lastProgress)
>>
>>
>>> 2b. getBatch is the time Source#getBatch took as measured by
>>> StreamExecution.
>>>
>> Yes, it is the time taken by the source prepare the DataFrame the has the
>> new data to be processed in the trigger.
>> Usually this is low, but its not guaranteed to be as some sources may
>> require complicated tracking and bookkeeping to prepare the DataFrame.
>>
>>
>>> 3. triggerExecution is effectively end-to-end processing time for the
>>> micro-batch, note all other durations sum closely to triggerExecution,
>>> there
>>> is a little slippage based on book-keeping activities in StreamExecution.
>>>
>>
>> Yes. Precisely.
>>
>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-tp
>>> 29056p29105.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>

Mime
View raw message