spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-23541) Allow Kafka source to read data with greater parallelism than the number of topic-partitions
Date Thu, 01 Mar 2018 01:32:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-23541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-23541:
------------------------------------

    Assignee: Apache Spark  (was: Tathagata Das)

> Allow Kafka source to read data with greater parallelism than the number of topic-partitions
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-23541
>                 URL: https://issues.apache.org/jira/browse/SPARK-23541
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Tathagata Das
>            Assignee: Apache Spark
>            Priority: Major
>
> Currently, when the Kafka source reads from Kafka, it generates as many tasks as the
number of partitions in the topic(s) to be read. In some case, it may be beneficial to read
the data with greater parallelism, that is, with more number partitions/tasks. That means,
offset ranges must be divided up into smaller ranges such the number of records in partition
~= total records in batch / desired partitions. This would also balance out any data skews
between topic-partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message