spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Silvio Fiorito <>
Subject Re: Reading parquet files in parallel on the cluster
Date Tue, 25 May 2021 20:10:14 GMT
Why not just read from Spark as normal? Do these files have different or incompatible schemas?

val df =“mergeSchema”, “true”).load(listOfPaths)

From: Eric Beabes <>
Date: Tuesday, May 25, 2021 at 1:24 PM
To: spark-user <>
Subject: Reading parquet files in parallel on the cluster

I've a use case in which I need to read Parquet files in parallel from over 1000+ directories.
I am doing something like this:

   val df = list.toList.toDF()

    df.foreach(c => {
      val config = getConfigs()
      doSomething(spark, config)

In the doSomething method, when I try to do this:

val df1 =

I get a NullPointer exception given below. It seems the '' only works on the Driver
not on the cluster. How can I do what I want to do? Please let me know. Thank you.

21/05/25 17:03:50 WARN TaskSetManager: Lost task 2.0 in stage 8.0 (TID 9,,
executor 11): java.lang.NullPointerException

        at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)

        at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142)

        at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:789)


View raw message