spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Qiao <richardqiao2...@gmail.com>
Subject Re: Apache Spark - Structured Streaming Query Status - field descriptions
Date Sun, 11 Feb 2018 10:17:21 GMT
Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter”
is helpful to answer some of them.

For example:
  inputRowsPerSecond = numRecords / inputTimeSec,
  processedRowsPerSecond = numRecords / processingTimeSec
This is explaining why the 2 rowPerSec difference.

> On Feb 10, 2018, at 8:42 PM, M Singh <mans2singh@yahoo.com.INVALID> wrote:
> 
> Hi:
> 
> I am working with spark 2.2.0 and am looking at the query status console output.  
> 
> My application reads from kafka - performs flatMapGroupsWithState and then aggregates
the elements for two group counts.  The output is send to console sink.  I see the following
output  (with my questions in bold). 
> 
> Please me know where I can find detailed description of the query status fields for spark
structured streaming ?
> 
> 
> StreamExecution: Streaming query made progress: {
>   "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",
>   "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",
>   "name" : null,
>   "timestamp" : "2018-02-11T01:18:00.005Z",
>   "numInputRows" : 5780,
>   "inputRowsPerSecond" : 96.32851690748795,            
>   "processedRowsPerSecond" : 583.9563548191554,   // Why is the number of processedRowsPerSecond
greater than inputRowsPerSecond ? Does this include shuffling/grouping ?
>   "durationMs" : {
>     "addBatch" : 9765,                                                    // Is the time
taken to get send output to all console output streams ? 
>     "getBatch" : 3,                                                           // Is this
time taken to get the batch from Kafka ?
>     "getOffset" : 3,                                                           // Is
this time for getting offset from Kafka ?
>     "queryPlanning" : 89,                                                 // The value
of this field changes with different triggers but the query is not changing so why does this
change ?
>     "triggerExecution" : 9898,                                         // Is this total
time for this trigger ?
>     "walCommit" : 35                                                     // Is this for
checkpointing ?
>   },
>   "stateOperators" : [ {                                                   // What are
the two state operators ? I am assuming one is flatMapWthState (first one).
>     "numRowsTotal" : 8,
>     "numRowsUpdated" : 1
>   }, {
>     "numRowsTotal" : 6,                                                //Is this the
group by state operator ?  If so, I have two group by so why do I see only one ?
>     "numRowsUpdated" : 6
>   } ],
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[xyz]]",
>     "startOffset" : {
>       "xyz" : {
>         "2" : 9183,
>         "1" : 9184,
>         "3" : 9184,
>         "0" : 9183
>       }
>     },
>     "endOffset" : {
>       "xyz" : {
>         "2" : 10628,
>         "1" : 10629,
>         "3" : 10629,
>         "0" : 10628
>       }
>     },
>     "numInputRows" : 5780,
>     "inputRowsPerSecond" : 96.32851690748795,
>     "processedRowsPerSecond" : 583.9563548191554
>   } ],
>   "sink" : {
>     "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c"
>   }
> }
> 
> 


Mime
View raw message