spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Litvak <boris.lit...@skf.com>
Subject RE: [Spark Structured Streaming] Processing the data path coming from kafka.
Date Mon, 18 Jan 2021 13:45:46 GMT
Hi Amit,

Why won’t you just map()/mapXXX() the kafkaDf with the mapping function that reads the paths?
Also, do you really have to read the json into an additional dataframe?

Thanks, Boris

From: Amit Joshi <mailtojoshiamit@gmail.com>
Sent: Monday, 18 January 2021 15:04
To: spark-user <user@spark.apache.org>
Subject: [Spark Structured Streaming] Processing the data path coming from kafka.

Hi ,

I have a use case where the file path of the json records stored in s3 are coming as a kafka
message in kafka. I have to process the data using spark structured streaming.

The design which I thought is as follows:
1. In kafka Spark structures streaming, read the message containing the data path.
2. Collect the message record in driver. (Messages are small in sizes)
3. Create the dataframe from the datalocation.


kafkaDf.select($"value".cast(StringType))
  .writeStream.foreachBatch((batchDf:DataFrame, batchId:Long) =>  {

//rough code

//collec to driver

val records = batchDf.collect()

//create dataframe and process
records foreach((rec: Row) =>{
  println("records:######################",rec.toString())
  val path = rec.getAs[String]("data_path")

  val dfToProcess =spark.read.json(path)

  ....

})

}

I would like to know the views, if this approach is fine? Specifically if there is some problem
with

with creating the dataframe after calling collect.

If there is any better approach, please let know the same.



Regards

Amit Joshi
Mime
View raw message