spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <>
Subject Re: repartition in Spark
Date Mon, 09 Nov 2020 23:14:07 GMT
As a generic answer in a distributed environment like spark, making sure
that data is distributed evenly among all nodes (assuming every node is the
same or similar) can help performance

repartition thus controls the data distribution among all nodes. However,
it is not that straight forward. Your mileage varies simply because
changing the distribution is related to a cost for physical data movement
on the cluster nodes (a so-called shuffle).

So there is a cost associated with repartition due to creation of shuffle.
You need to see the execution plan by using df.explain() or looking at
spark GUI to see the physical plan.

In simplest form repartition(n) will distribute the data randomly and I
think that is the most common form. However, this also depends on the
volume of data. For smaller volumes I don't think it really matters.
However, for large volumes of data, repartition may be an option, if the
data in joining is skewed. However, you need to know the volume of data
before deploying partitioning.


LinkedIn *

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 9 Nov 2020 at 16:57,
<> wrote:

> Hi,
> Just need some advise.
>    1. When we have multiple spark nodes running code, under what
>    conditions a repartition make sense?
>    2. Can we repartition and cache the result --> df = spark.sql("select
>    from ...").repartition(4).cache
>    3. If we choose a repartition (4), will that repartition applies to
>    all nodes running the code and how can one see that?
> Thanks

View raw message