spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: Structured Streaming: multiple sinks
Date Fri, 25 Aug 2017 06:03:36 GMT
Responses inline.

On Thu, Aug 24, 2017 at 7:16 PM, cbowden <cbcwebdev@gmail.com> wrote:

> 1. would it not be more natural to write processed to kafka and sink
> processed from kafka to s3?
>

I am sorry i dont fully understand this question. Could you please
elaborate further, as in, what is more natural than what?


> 2a. addBatch is the time Sink#addBatch took as measured by StreamExecution.
>

Yes. This essentially includes the time taken to compute the output and
finish writing the output to the sink.
(**to give some context for other readers, this person is referring to the
different time durations reported through StreamingQuery.lastProgress)


> 2b. getBatch is the time Source#getBatch took as measured by
> StreamExecution.
>
Yes, it is the time taken by the source prepare the DataFrame the has the
new data to be processed in the trigger.
Usually this is low, but its not guaranteed to be as some sources may
require complicated tracking and bookkeeping to prepare the DataFrame.


> 3. triggerExecution is effectively end-to-end processing time for the
> micro-batch, note all other durations sum closely to triggerExecution,
> there
> is a little slippage based on book-keeping activities in StreamExecution.
>

Yes. Precisely.


>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Structured-Streaming-multiple-
> sinks-tp29056p29105.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message