spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Stojanov <m...@danielstojanov.com>
Subject [Pyspark] - Spark uses all available memory; unrelated to size of dataframe
Date Wed, 08 Apr 2020 12:58:16 GMT
My setup: using Pyspark; Mongodb to retrieve and store final results;
Spark is in standalone cluster mode, on a single desktop. Spark v.2.4.4.
Openjdk 8.

My spark application (using pyspark) uses all available system memory.
This seems to be unrelated to the data being processed. I tested with
32GB and 64GB of RAM. In both cases memory use was about 1-2GB less than
the total system memory.

From searching, other people facing memory issues have large datasets,
but the amount of data I am dealing with is much smaller.

I have a dataframe that gets processed/aggregated into a smaller
dataframe. It begins with a dataframe of about 1 million rows, ~100MB as
a zipped csv file, then aggregates the results into a smaller dataframe
of about 20 columns, 200 rows. The size of this smaller dataframe is
15KB (from Spark's UI page, under storage).

I am using regular pyspark commands (including aggregating over several
window functions) on dataframes. After a series of computations the end
result is a 15KB dataframe. Then a series of Python UDFs and joins to
calculate further results from this 15KB dataframe. This produces a
similarly small dataframe. There is a particularly big spike in memory
usage during these steps, even though the data processed is trivially small.

Using the Spark UI page’s display of storage the RDDs shown use trivial
(each RDD shown only KB in size) amounts of memory.

This is the output of the “executors” tab from the Spark UI, under
Storage Memory: 1.7 MB / 15.5 GB.

The amount of memory that htop shows the JVMs are using is unrelated to
the number of executors * RAM per executor setting in spark-submit.


Why is Spark’s memory usage unrelated to any settings in spark-submit
(standalone cluster mode)?

What is the memory being used for? I can confirm from the Spark UI page
that almost none of that is dataframe data?

If I no longer have a reference to a dataframe in Python does that mean
the corresponding dataframe has been garbage collected from Spark?

Do dataframes in pyspark need to be explicitly garbage collected in some
way?



Mime
View raw message