spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JaeBoo Jung <itsjb.j...@samsung.com>
Subject Re: spark 1.2 three times slower than spark 1.1, why?
Date Wed, 21 Jan 2015 09:01:21 GMT
<HTML><HEAD><TITLE>Samsung Enterprise Portal mySingle</TITLE>
<META content=IE=5 http-equiv=X-UA-Compatible>
<META content="text/html; charset=utf-8" http-equiv=Content-Type>
<STYLE id=mysingle_style type=text/css>P {
	MARGIN-BOTTOM: 5px; FONT-SIZE: 9pt; FONT-FAMILY: Arial, arial; MARGIN-TOP: 5px
}
TD {
	MARGIN-BOTTOM: 5px; FONT-SIZE: 9pt; FONT-FAMILY: Arial, arial; MARGIN-TOP: 5px
}
LI {
	MARGIN-BOTTOM: 5px; FONT-SIZE: 9pt; FONT-FAMILY: Arial, arial; MARGIN-TOP: 5px
}
BODY {
	FONT-FAMILY: Arial, arial; MARGIN: 10px; LINE-HEIGHT: 1.4
}
</STYLE>

<META name=GENERATOR content=ActiveSquare></HEAD>
<BODY>
<P>I was recently faced with a similar issue, but unfortunately&nbsp;I&nbsp;could
not&nbsp;find out why it happened.</P>
<P>Here's&nbsp;jira ticket <A href="https://issues.apache.org/jira/browse/SPARK-5081">https://issues.apache.org/jira/browse/SPARK-5081</A>&nbsp;of
my previous post.</P>
<P>Please check&nbsp;your shuffle I/O differences between the two in spark web UI
because it&nbsp;can be&nbsp;possibly related to my case.</P>
<P>&nbsp;</P>
<P>Thanks</P>
<P>Kevin</P>
<P>&nbsp;</P>
<P>------- <B>Original Message</B> -------</P>
<P><B>Sender</B> : Fengyun RAO&lt;raofengyun@gmail.com&gt;</P>
<P><B>Date</B> : 2015-01-21 17:41 (GMT+09:00)</P>
<P><B>Title</B> : Re: spark 1.2 three times slower than spark 1.1, why?</P>
<P>&nbsp;</P>
<DIV dir=ltr>
<DIV class=markdown-here-wrapper>
<P style="MARGIN: 1.2em 0px">maybe you mean different <CODE style="FONT-SIZE: 0.85em;
BORDER-TOP: rgb(234,234,234) 1px solid; FONT-FAMILY: Consolas,Inconsolata,Courier,monospace;
BORDER-RIGHT: rgb(234,234,234) 1px solid; WHITE-SPACE: pre-wrap; BORDER-BOTTOM: rgb(234,234,234)
1px solid; PADDING-BOTTOM: 0px; PADDING-TOP: 0px; PADDING-LEFT: 0.3em; MARGIN: 0px 0.15em;
BORDER-LEFT: rgb(234,234,234) 1px solid; DISPLAY: inline; PADDING-RIGHT: 0.3em; BACKGROUND-COLOR:
rgb(248,248,248); border-radius: 3px">spark-submit</CODE> script?</P>
<P style="MARGIN: 1.2em 0px">we also use the same <CODE style="FONT-SIZE: 0.85em;
BORDER-TOP: rgb(234,234,234) 1px solid; FONT-FAMILY: Consolas,Inconsolata,Courier,monospace;
BORDER-RIGHT: rgb(234,234,234) 1px solid; WHITE-SPACE: pre-wrap; BORDER-BOTTOM: rgb(234,234,234)
1px solid; PADDING-BOTTOM: 0px; PADDING-TOP: 0px; PADDING-LEFT: 0.3em; MARGIN: 0px 0.15em;
BORDER-LEFT: rgb(234,234,234) 1px solid; DISPLAY: inline; PADDING-RIGHT: 0.3em; BACKGROUND-COLOR:
rgb(248,248,248); border-radius: 3px">spark-submit</CODE> script, thus the same memory,
cores, etc configuration.</P>
<DIV title="MDH:bWF5YmUgeW91IG1lYW4gZGlmZmVyZW50IGBzcGFyay1zdWJtaXRgIHNjcmlwdD88ZGl2Pjxicj48&#13;&#10;L2Rpdj48ZGl2PndlIGFsc28gdXNlIHRoZSBzYW1lIGBzcGFyay1zdWJtaXRgIHNjcmlwdCwgdGh1&#13;&#10;cyB0aGUgc2FtZSBtZW1vcnksIGNvcmVzLCBldGMgY29uZmlndXJhdGlvbi48L2Rpdj4="
style="OVERFLOW: hidden; FONT-SIZE: 0em; MAX-WIDTH: 0px; HEIGHT: 0px; WIDTH: 0px; PADDING-BOTTOM:
0px; PADDING-TOP: 0px; PADDING-LEFT: 0px; MARGIN: 0px; PADDING-RIGHT: 0px; MAX-HEIGHT: 0px">​</DIV></DIV></DIV>
<DIV class=gmail_extra><BR>
<DIV class=gmail_quote>2015-01-21 15:45 GMT+08:00 Sean Owen <SPAN dir=ltr>&lt;<A
href="mailto:sowen@cloudera.com" target=_blank>sowen@cloudera.com</A>&gt;</SPAN>:
<BR>
<BLOCKQUOTE class=gmail_quote style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT:
rgb(204,204,204) 1px solid">
<P dir=ltr>I don't know of any reason to think the singleton pattern doesn't work or
works differently. I wonder if, for example, task scheduling is different in 1.2 and you have
more partitions across more workers and so are loading more copies more slowly into your singletons.
</P>
<DIV class=HOEnZb>
<DIV class=h5>
<DIV class=gmail_quote>On Jan 21, 2015 7:13 AM, "Fengyun RAO" &lt;<A href="mailto:raofengyun@gmail.com"
target=_blank>raofengyun@gmail.com</A>&gt; wrote: <BR type="attribution">
<BLOCKQUOTE class=gmail_quote style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT:
rgb(204,204,204) 1px solid">
<DIV dir=ltr>
<DIV>
<P style="MARGIN: 1.2em 0px">the <CODE style="FONT-SIZE: 0.85em; BORDER-TOP: rgb(234,234,234)
1px solid; FONT-FAMILY: Consolas,Inconsolata,Courier,monospace; BORDER-RIGHT: rgb(234,234,234)
1px solid; WHITE-SPACE: pre-wrap; BORDER-BOTTOM: rgb(234,234,234) 1px solid; PADDING-BOTTOM:
0px; PADDING-TOP: 0px; PADDING-LEFT: 0.3em; MARGIN: 0px 0.15em; BORDER-LEFT: rgb(234,234,234)
1px solid; DISPLAY: inline; PADDING-RIGHT: 0.3em; BACKGROUND-COLOR: rgb(248,248,248); border-radius:
3px">LogParser</CODE> instance is not serializable, and thus cannot be a broadcast,
</P>
<P style="MARGIN: 1.2em 0px">what’s worse, it contains an LRU cache, which is essential
to the performance, and we would like to share among all the tasks on the same node.</P>
<P style="MARGIN: 1.2em 0px">If it is the case, what’s the recommended way to share
a variable among all the tasks within the same executor.</P>
<DIV title="MDH:dGhlIGBMb2dQYXJzZXJgIGluc3RhbmNlIGlzIG5vdCBzZXJpYWxpemFibGUsIGFuZCB0aHVzIGNh&#13;&#10;bm5vdCBiZSBhIGJyb2FkY2FzdCzCoDxkaXY+PGJyPjwvZGl2PjxkaXY+d2hhdCdzIHdvcnNlLCBp&#13;&#10;dCBjb250YWlucyBhbiBMUlUgY2FjaGUsIHdoaWNoIGlzIGVzc2VudGlhbCB0byB0aGUgcGVyZm9y&#13;&#10;bWFuY2UsIGFuZCB3ZSB3b3VsZCBsaWtlIHRvIHNoYXJlIGFtb25nIGFsbCB0aGUgdGFza3Mgb24g&#13;&#10;dGhlIHNhbWUgbm9kZS48L2Rpdj48ZGl2Pjxicj48L2Rpdj48ZGl2PklmIGl0IGlzIHRoZSBjYXNl&#13;&#10;LCB3aGF0J3MgdGhlIHJlY29tbWVuZGVkIHdheSB0byBzaGFyZSBhIHZhcmlhYmxlIGFtb25nIGFs&#13;&#10;bCB0aGUgdGFza3Mgd2l0aGluIHRoZSBzYW1lIGV4ZWN1dG9yLjwvZGl2Pg=="
style="OVERFLOW: hidden; FONT-SIZE: 0em; MAX-WIDTH: 0px; WIDTH: 0px; PADDING-BOTTOM: 0px;
PADDING-TOP: 0px; PADDING-LEFT: 0px; MARGIN: 0px; MIN-HEIGHT: 0px; PADDING-RIGHT: 0px; MAX-HEIGHT:
0px">​</DIV></DIV></DIV>
<DIV class=gmail_extra><BR>
<DIV class=gmail_quote>2015-01-21 15:04 GMT+08:00 Davies Liu <SPAN dir=ltr>&lt;<A
href="mailto:davies@databricks.com" target=_blank>davies@databricks.com</A>&gt;</SPAN>:
<BR>
<BLOCKQUOTE class=gmail_quote style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT:
rgb(204,204,204) 1px solid">Maybe some change related to serialize the closure cause LogParser
is <BR>not a singleton any more, then it is initialized for every task. <BR><BR>Could
you change it to a Broadcast? <BR>
<DIV>
<DIV><BR>On Tue, Jan 20, 2015 at 10:39 PM, Fengyun RAO &lt;<A href="mailto:raofengyun@gmail.com"
target=_blank>raofengyun@gmail.com</A>&gt; wrote: <BR>&gt; Currently
we are migrating from spark 1.1 to spark 1.2, but found the <BR>&gt; program 3x
slower, with nothing else changed. <BR>&gt; note: our program in spark 1.1 has successfully
processed a whole year data, <BR>&gt; quite stable. <BR>&gt; <BR>&gt;
the main script is as below <BR>&gt; <BR>&gt; sc.textFile(inputPath) <BR>&gt;
.flatMap(line =&gt; LogParser.parseLine(line)) <BR>&gt; .groupByKey(new HashPartitioner(numPartitions))
<BR>&gt; .mapPartitionsWithIndex(...) <BR>&gt; .foreach(_ =&gt; {})
<BR>&gt; <BR>&gt; where LogParser is a singleton which may take some time
to initialized and <BR>&gt; is shared across the execuator. <BR>&gt; <BR>&gt;
the flatMap stage is 3x slower. <BR>&gt; <BR>&gt; We tried to change spark.shuffle.manager
back to hash, and <BR>&gt; spark.shuffle.blockTransferService back to nio, but didn’t
help. <BR>&gt; <BR>&gt; May somebody explain possible causes, or what
should we test or change to <BR>&gt; find it out <BR></DIV></DIV></BLOCKQUOTE></DIV><BR></DIV></BLOCKQUOTE></DIV></DIV></DIV></BLOCKQUOTE></DIV></DIV>
<P>&nbsp;</P></BODY></HTML><img src='http://ext.samsung.net/mailcheck/SeenTimeChecker?do=b65f9d91e1020aef8d9f985933a950c1112d5453064fd51efc144f529aba3c5c911590d97d85df7fd4b5cb504b28632862e1ac75b522795a07805447a154a46fcf878f9a26ce15a0'
border=0 width=0 height=0 style='display:none'>
Mime
View raw message