spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Bowden <cbcweb...@gmail.com>
Subject Re: Structured Streaming: multiple sinks
Date Fri, 25 Aug 2017 07:48:22 GMT
Tathagata, thanks for filling in context for other readers on 2a and 2b, I
summarized too much in hindsight.

Regarding the OP's first question, I was hinting it is quite natural to
chain processes via kafka. If you are already interested in writing
processed data to kafka, why add complexity to a job by having it commit
processed data to kafka and s3 vs. simply moving the processed data from
kafka out to s3 as needed. Perhaps the OP's thread got lost in context
based on how I responded.

1) We are consuming from  kafka using  structured streaming and  writing
the processed data set to s3.
We also want to write the processed data to kafka moving forward, is it
possible to do it from the same streaming query ? (spark  version 2.1.1)

Streaming queries are currently bound to a single sink, so multiplexing the
write with existing sinks via the <same> streaming query isn't possible
AFAIK. Arguably you can reuse the "processed data" DAG by starting multiple
sinks against it, though you will effectively process the data twice on
different "schedules" since each sink will effectively have its own
instance of StreamExecution, TriggerExecutor, etc. If you *really* wanted
to do one pass of the data and process the same exact block of data per
micro batch you could implement it via foreach or a custom sink which
writes to kafka and s3, but I wouldn't recommend it. As stated above, it is
quite natural to chain processes via kafka.

On Thu, Aug 24, 2017 at 11:03 PM, Tathagata Das <tathagata.das1565@gmail.com
> wrote:

> Responses inline.
>
> On Thu, Aug 24, 2017 at 7:16 PM, cbowden <cbcwebdev@gmail.com> wrote:
>
>> 1. would it not be more natural to write processed to kafka and sink
>> processed from kafka to s3?
>>
>
> I am sorry i dont fully understand this question. Could you please
> elaborate further, as in, what is more natural than what?
>
>
>> 2a. addBatch is the time Sink#addBatch took as measured by
>> StreamExecution.
>>
>
> Yes. This essentially includes the time taken to compute the output and
> finish writing the output to the sink.
> (**to give some context for other readers, this person is referring to the
> different time durations reported through StreamingQuery.lastProgress)
>
>
>> 2b. getBatch is the time Source#getBatch took as measured by
>> StreamExecution.
>>
> Yes, it is the time taken by the source prepare the DataFrame the has the
> new data to be processed in the trigger.
> Usually this is low, but its not guaranteed to be as some sources may
> require complicated tracking and bookkeeping to prepare the DataFrame.
>
>
>> 3. triggerExecution is effectively end-to-end processing time for the
>> micro-batch, note all other durations sum closely to triggerExecution,
>> there
>> is a little slippage based on book-keeping activities in StreamExecution.
>>
>
> Yes. Precisely.
>
>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-
>> tp29056p29105.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Mime
View raw message