spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Joshi <>
Subject [Spark-Kafka-Streaming] Verifying the approach for multiple queries
Date Sun, 09 Aug 2020 18:36:53 GMT

I have a scenario where a kafka topic is being written with different types
of json records.
I have to regroup the records based on the type and then fetch the schema
and parse and write as parquet.
I have tried structured programming. But dynamic schema is a constraint.
So I have used DStreams and though I know the approach I have taken may not
be good.
If anyone can pls let me know if the approach will scale and possible pros
and cons.
I am collecting the grouped records and then again forming the dataframe
for each grouped record.
createKeyValue -> This is creating the key value pair with schema

stream.foreachRDD { (rdd, time) =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  val result =,y) => x ++ y).collect()
  result.foreach(x=> println(x._1))> {
    val spark =
    import spark.implicits._
    import org.apache.spark.sql.functions._
    val df = x._2 toDF("value")$"value", x._1._2, Map.empty[String,String]).as("data"))
      //.withColumn("entity", lit("invoice"))

View raw message