spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Silvio Fiorito <silvio.fior...@granturing.com>
Subject Re: Reading parquet files in parallel on the cluster
Date Tue, 25 May 2021 20:10:14 GMT
Why not just read from Spark as normal? Do these files have different or incompatible schemas?

val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths)

From: Eric Beabes <mailinglists19@gmail.com>
Date: Tuesday, May 25, 2021 at 1:24 PM
To: spark-user <user@spark.apache.org>
Subject: Reading parquet files in parallel on the cluster

I've a use case in which I need to read Parquet files in parallel from over 1000+ directories.
I am doing something like this:


   val df = list.toList.toDF()

    df.foreach(c => {
      val config = getConfigs()
      doSomething(spark, config)
    })



In the doSomething method, when I try to do this:

val df1 = spark.read.parquet(pathToRead).collect()



I get a NullPointer exception given below. It seems the 'spark.read' only works on the Driver
not on the cluster. How can I do what I want to do? Please let me know. Thank you.



21/05/25 17:03:50 WARN TaskSetManager: Lost task 2.0 in stage 8.0 (TID 9, ip-10-0-5-3.us-west-2.compute.internal,
executor 11): java.lang.NullPointerException



        at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)



        at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142)



        at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:789)



        at org.apache.spark.sql.SparkSession.read(SparkSession.scala:656)


Mime
View raw message