spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Stern <adr...@vidora.com>
Subject bucket joins on multiple data frames.
Date Wed, 08 Sep 2021 19:45:00 GMT
Sorry if this has been answered, but I had a question about bucketed joins
that I can't seem to find the answer to online.


   - I have a bunch of pyspark data frames (let's call them df1, df2,
   ...df10). I need to join them all together using the same key.
   - joined = df1.join(df2, "key", "full")
      - joined = joined.join(df3, "key", "full")
      - joined = joined.join(df4, "key", "full")
      - ...
   - I saw bucketed joins can help in this situation, but when I try to do
   it, I only get a bucket edjoin on the first join, and then I have to
   re-create a bucket table of joined results after each join otherwise I
   don't get a bucket join. This process of re-creating the joined table only
   slows the join down and I don't see any performance gain.
   - Doesn't work: (pseudo code)
         - df1.write-bucketed() ; t1 = spark.table("df1")
         - df2.write-bucketed() ; t2 = spark.table("df2")
         - df3.write-bucketed() ; t3 = spark.table("df3")
         - joined = t1.join(t2, "key", "full")
         - joined = joined.join(t3, "key", "full")
      - Works but is slow:  (pseudo code)
         - df1.write-bucketed() ; t1 = spark.table("df1")
         - df2.write-bucketed() ; t2 = spark.table("df2")
         - df3.write-bucketed() ; t3 = spark.table("df3")
         - joined = t1.join(t2, "key", "full")
         - joined.write-bucketed() ; joined = spark.table("joined")
         - joined = joined.join(t3, "key", "full")


I'm wondering if there is a way to get performance gains here, either by
using bucketing or some other way.
Also courions if this isn't what bucket joins are for, what are they
actually for.

Thanks
Adrian

Mime
View raw message