spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From grp <gpete...@villanova.edu>
Subject Re: [EXTERNAL] Re: Conflicting PySpark Storage Level Defaults?
Date Mon, 16 Sep 2019 22:21:39 GMT
Running a simple test - here is the stack overflow code snippet using .count() as the action.
 You can see the differences between the storage levels.

print(spark.version)
2.4.3

# id 3 => using default storage level for df (memory_and_disk) and unsure why storage level
is not serialized since i am using pyspark
df = spark.range(10)
print(type(df))
df.cache().count()
print(df.storageLevel)

# id 15 => using default storage level for rdd (memory_only) and makes sense why it is
serialized
rdd = df.rdd
print(type(rdd))
rdd.cache().collect()

# id 19 => manually configuring to (memory_and_disk) which makes the storage level serialized
df2 = spark.range(100)
from pyspark import StorageLevel
print(type(df2))
df2.persist(StorageLevel.MEMORY_AND_DISK).count()
print(df2.storageLevel)

<class 'pyspark.sql.dataframe.DataFrame'>
Disk Memory Deserialized 1x Replicated
<class 'pyspark.rdd.RDD'>
<class 'pyspark.sql.dataframe.DataFrame'>
Disk Memory Serialized 1x Replicated

> On Sep 16, 2019, at 2:02 AM, Jörn Franke <jornfranke@gmail.com> wrote:
> 
> I don’t know your full source code but you may missing an action so that it is indeed
persisted.
> 
>> Am 16.09.2019 um 02:07 schrieb grp <gpeterne@villanova.edu>:
>> 
>> Hi There Spark Users,
>> 
>> Curious what is going on here.  Not sure if possible bug or missing something.  Extra
eyes are much appreciated.
>> 
>> Spark UI (Python API 2.4.3) by default is reporting persisted data-frames to be de-serialized
MEMORY_AND_DISK however I always thought they were serialized for Python by default according
to official documentation.
>> However when explicitly changing the storage level to default … ex => df.persist(StorageLevel.MEMORY_AND_DISK)
… the Spark UI returns the expected serialized data-frame under Storage Tab, but not when
just calling … df.cache().
>> 
>> Do we have to explicitly set to … StorageLevel.MEMORY_AND_DISK … to get the serialized
benefit in Python (which I thought was automatic)?  Or is the Spark UI incorrect?
>> 
>> SO post with specific example/details => https://stackoverflow.com/questions/56926337/conflicting-pyspark-storage-level-defaults
>> 
>> Thank you for your time and research!
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> 


Mime
View raw message