spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data
Date Tue, 04 Oct 2016 19:05:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546330#comment-15546330
] 

Reynold Xin commented on SPARK-16827:
-------------------------------------

We should probably separate the on-disk spill from the shuffle size. Would you have time to
work on that?


> Query with Join produces excessive amount of shuffle data
> ---------------------------------------------------------
>
>                 Key: SPARK-16827
>                 URL: https://issues.apache.org/jira/browse/SPARK-16827
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Sital Kedia
>              Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>      FROM  table1 a
>      JOIN table2 b
>       ON    a.ds = '2016-07-15'
>       AND  b.ds = '2016-07-15'
>       AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little into it,
we found out that one of the stages produces excessive amount of shuffle data.  Please note
that this is a regression from Spark 1.6. Stage 2 of the job which used to produce 32KB shuffle
data with 1.6, now produces more than 400GB with Spark 2.0. We also tried turning off whole
stage code generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still produces accurate
output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message