spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vladio <>
Subject [DataFrame] partitionBy issues
Date Tue, 23 Jun 2015 18:26:11 GMT

I'm running into a strange memory scaling issue when using the partitionBy
feature of DataFrameWriter.

I've generated a table (a CSV file) with 3 columns (A, B and C) and 32*32
different entries, with size on disk of about 20kb. There are 32 distinct
values for column A and 32 distinct values for column B and all these are
combined together (column C will contain a random number for each row - it
doesn't matter) producing a 32*32 elements data set. I've imported this into
Spark and I ran a partitionBy("A", "B") in order to test its performance.
This should create a nested directory structure with 32 folders, each of
them containing another 32 folders. It uses about 10Gb of RAM and it's
running slow. If I increase the number of entries in the table from 32*32 to
128*128, I get Java Heap Space Out Of Memory no matter what value I use for
Heap Space variabile. Is this a known bug?

Scala code:
var df ="com.databricks.spark.csv").option("header",
df.write.partitionBy("A", "B").mode("overwrite").parquet("table.parquet”)

How I ran the Spark shell:
bin/spark-shell --driver-memory 16g --master local[8] --packages

Attached you'll find table.csv which I used.  table.csv

Thank you,

View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message