spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From anu <anamika.guo...@gmail.com>
Subject Optimizing SQL Query
Date Fri, 06 Mar 2015 12:07:25 GMT
I have a query that's like:

Could you help in providing me pointers as to how to start to optimize it
w.r.t. spark sql:


sqlContext.sql("

SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE

FROM (
               SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS
SDP_USAGE

               FROM (
                             SELECT *

                             FROM date_d dd JOIN interval_f intf

                             ON intf.DATE_WID = dd.WID

                             WHERE intf.DATE_WID >= 20141116 AND
intf.DATE_WID <= 20141125 AND CAST(INTERVAL_END_TIME AS STRING) >=
'2014-11-16 00:00:00.000' AND  CAST(INTERVAL_END_TIME 
                                           AS STRING) <= '2014-11-26
00:00:00.000' AND MEAS_WID = 3

                          ) test JOIN sdp_d sdp

               ON test.SDP_WID = sdp.WID

               WHERE sdp.UDC_ID = 'SP-1931201848'

               GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID

           ) dw

GROUP BY dw.DAY_OF_WEEK, dw.HOUR")



Currently the query takes 15 minutes execution time where interval_f table
holds approx 170GB of raw data, date_d --> 170 MB and sdp_d --> 490MB 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message