spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cody Koeninger (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-15272) DirectKafkaInputDStream doesn't work with window operation
Date Wed, 12 Oct 2016 23:33:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15570221#comment-15570221
] 

Cody Koeninger edited comment on SPARK-15272 at 10/12/16 11:33 PM:
-------------------------------------------------------------------

Does the 0.10 consumer's handling of preferred locations http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#locationstrategies
address this for you?


was (Author: cody@koeninger.org):
Checking to see if the 0.10 consumer's handling of preferred locations http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#locationstrategies
addresses this.

> DirectKafkaInputDStream doesn't work with window operation
> ----------------------------------------------------------
>
>                 Key: SPARK-15272
>                 URL: https://issues.apache.org/jira/browse/SPARK-15272
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.5.2
>            Reporter: Lubomir Nerad
>
> Using Kafka direct {{DStream}} with simple window operation like:
> {code:java}
> kafkaDStream.window(Durations.milliseconds(10000),
>                     Durations.milliseconds(1000));
>             .print();
> {code}
> with 1s batch duration either freezes after several seconds or lags terribly (depending
on cluster mode).
> This happens when Kafka brokers are not part of the Spark cluster (they are on different
nodes). The {{KafkaRDD}} still reports them as preferred locations. This doesn't seem to be
problem in non-window scenarios but with window it conflicts with delay scheduling algorithm
implemented in {{TaskSetManager}}. It either significantly delays (Yarn mode) or completely
drains (Spark mode) resource offers with {{TaskLocality.ANY}} which are needed to process
tasks with these Kafka broker aligned preferred locations. When delay scheduling algorithm
is switched off ({{spark.locality.wait=0}}), the example works correctly.
> I think that the {{KafkaRDD}} shouldn't report preferred locations if the brokers don't
correspond to worker nodes or allow the reporting of preferred locations to be switched off.
Also it would be good if delay scheduling algorithm didn't drain / delay offers in the case,
the tasks have unmatched preferred locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message