I am running a spark job on a huge dataset. I have allocated 10 R5.16xlarge machines. (each consists 64cores, 512G).
The source data is json and i need to do some json transformations. So, i read them as text and then convert to a dataframe.
ds = spark.read.textFile()
updated_dataset = ds.withColumn(applying my transformations).as[String]
df = spark.read.json(updated_dataset)
The source data is heavy and deeply nested. The printSchema contains a lot of nested structs.
in the spark ui, json stage is first and after that is completed, it is not showing any jobs in the UI and it's just hanging there.
All executors were dead and only the driver was active.