spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Subhajit Purkayastha" <spurk...@p3si.net>
Subject Spark DataFrame Join _ performance issues
Date Mon, 19 Sep 2016 19:28:51 GMT
I am running my spark (1.5.2) instance in a virtualbox VM. I have 10gb
memory allocated to it.

 

I have a fact table extract, with 10000 rows

 

var glbalance_df_select = glbalance_df.select
("LEDGER_ID","CODE_COMBINATION_ID","CURRENCY_CODE",

 
"PERIOD_TYPE","TEMPLATE_ID",

 
"PERIOD_NAME","ACTUAL_FLAG","BUDGET_VERSION_ID",

 
"TRANSLATED_FLAG","PERIOD_NET_DR","PERIOD_NET_CR",

 
"BEGIN_BALANCE_DR","BEGIN_BALANCE_CR")


                                                   .filter( 

 
not(glbalance_df("CURRENCY_CODE")=== "STAT") 

                                                       and 

 
(glbalance_df("TEMPLATE_ID").isNull || glbalance_df("TEMPLATE_ID") ===
"None")

                                                       and

 
(glbalance_df("TRANSLATED_FLAG") === "Y" ||
glbalance_df("TRANSLATED_FLAG").isNull || glbalance_df("TRANSLATED_FLAG")
=== "None" )

                                                       and

 
(glbalance_df("ACTUAL_FLAG") === "A" or glbalance_df("ACTUAL_FLAG") === "B")

                                                       )

 

 

I am joining the fact table to  the first dimension (with 100 rows). 

 

var glbalance_ledger_df = glbalance_df_select.join(ledger_df_select,     

              glbalance_df_select("LEDGER_ID") <=>
ledger_df_select("LEDGER_ID"),

             "inner" )

             .drop(ledger_df_select("LEDGER_ID"))

 

When I save the DataFrame "glbalance_ledger_df" to a textfile , it saves the
data in 1 mins

 

2nd dimension dataframe 

 

tableName = "w_gl_period_d"    

        var period_df_select = msc.table(s"$dbName.$tableName")

        period_df_select = period_df_select.select("PERIOD_NAME",
"PERIOD_TYPE",

                                            "PERIOD_SET_NAME"
,"START_DATE","END_DATE" ).cache()

 

Now I join the 2nd dimension DF to the resultant of the fact DF and save the
data, it takes 2hrs. 

 

var glbalance_ledger_period_df = glbalance_ledger_df.join(period_df_select,


              glbalance_ledger_df("PERIOD_SET_NAME") <=>
period_df_select("PERIOD_SET_NAME")

              && glbalance_ledger_df("PERIOD_NAME") <=>
period_df_select("PERIOD_NAME")

              && glbalance_ledger_df("PERIOD_TYPE") <=>
period_df_select("PERIOD_TYPE")

              ,

             "inner" )

 

 

How do I improve the performance of the join?

 

Thx,

 

Subhajit


Mime
View raw message