spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hammad <ham...@flexilogix.com>
Subject Fwd: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance
Date Sun, 01 Oct 2017 17:55:27 GMT
Hello,

*Background:*

I have Spark Streaming context;

SparkConf conf = new
SparkConf().setMaster("local[2]").setAppName("TransformerStreamPOC");
conf.set("spark.driver.allowMultipleContexts", "true");   *<== this*
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(60));


that subscribes to certain kafka *topics*;

JavaInputDStream<ConsumerRecord<String, String>> stream =
        KafkaUtils.createDirectStream(
                jssc,
                LocationStrategies.PreferConsistent(),
                ConsumerStrategies.<String, String>Subscribe(*topics*,
kafkaParams)
        );

when messages arrive in queue, I recursively process them as follows
(below code section will repeat in Question statement)

stream.foreachRDD(rdd -> {
//process here - below two scenarions code is inserted here

});


*Question starts here:*

Since I need to apply SparkSQL to received events in Queue - I create
SparkSession with two scenarios;

*1) Per partition one sparkSession (after
"spark.driver.allowMultipleContexts" set to true); so all events under
this partition are handled by same sparkSession*

rdd.foreachPartition(partition -> {
    SparkSession sparkSession = SparkSession
            .builder()
            .appName("Java Spark SQL basic example")
            .config("spark.some.config.option", "some-value")
            .getOrCreate();

    while (partition.hasNext()) {
      Dataset<Row> df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL,
"table_name", connectionProperties);

    }}

*2) Per event under each session; so each event under each queue under
each stream has one sparkSession;*

rdd.foreachPartition(partition -> {    while (partition.hasNext()) {
 SparkSession sparkSession = SparkSession.builder().appName("Java
Spark SQL basic example").config("spark.some.config.option",
"some-value").getOrCreate();

    Dataset<Row> df = sparkSession.read().jdbc(MYSQL_CONNECTION_URL,
"table_name", connectionProperties);

    }}


Is it good practice to create multiple contexts (lets say 10 or 100)?
How does number of sparkContext to be allowed vs number of worker nodes
relate?
What are performance considerations with respect to scenario1 and scenario2?

I am looking for these answers as I feel there is more to what I understand
of performance w.r.t sparkContexts created by a streaming application.
Really appreciate your support in anticipation.

Hammad

Mime
View raw message