spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ☼ R Nair <>
Subject Reading CSV and Transforming to Parquet Issue
Date Thu, 02 Sep 2021 16:55:29 GMT

This is very surprising and I am sure I might be doing something wrong. The
issue is, the following code is taking 8 hours. It reads a CSV file, takes
the phone number column, extracts the first four digits and then
partitions based on the four digits (phoneseries) and writes to Parquet.
Any clue on why ? The CSV file is just one million rows only. Thanks in
advance. Spark version is 3.0.1

*val df1 ="csv").option("header",

*//Above CSV contains a column names phonenumber which has a very side
range of values, total number //of rows in CSV is just below One million.
Also, test was done for just one file*

*val df =

*df.printSchema()  //Schema is printed correctly*



View raw message