spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Armbrust (JIRA)" <>
Subject [jira] [Commented] (SPARK-17813) Maximum data per trigger
Date Fri, 14 Oct 2016 23:04:20 GMT


Michael Armbrust commented on SPARK-17813:

I think its okay to ignore compacted topics, at least initially.  You would still respect
the "maximum" nature of the configuration, though would waste some effort scheduling tasks
smaller than the max.

I would probably start simple and just have a global {{maxOffsetsPerTrigger}} that bounds
the total number of records in each batch and is distributed amongst the topic partitions.
 topicpartitions that are skewed too small will not have enough offsets available and we can
spill that over to the ones that are skewed large.  We can always add something more complicated
in the future.

An alternative proposal would be to spread out the max to each partition proportional to the
total number of offsets available when planning.

Regarding [SPARK-17510], I would make this configuration an option to the DataStreamReader,
you'd be able to configure it perstream instead of globally.  So, I think we are good.

> Maximum data per trigger
> ------------------------
>                 Key: SPARK-17813
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
> At any given point in a streaming query execution, we process all available data.  This
maximizes throughput at the cost of latency.  We should add something similar to the {{maxFilesPerTrigger}}
option available for files.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message