spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Spark shell and StackOverFlowError
Date Mon, 31 Aug 2015 07:40:43 GMT
Yeah I see that now. I think it fails immediately because the map
operation does try to clean and/or verify the serialization of the
closure upfront.

I'm not quite sure what is going on, but I think it's some strange
interaction between how you're building up the list and what the
resulting representation happens to be like, and how the closure
cleaner works, which can't be perfect. The shell also introduces an
extra layer of issues.

For example, the slightly more canonical approaches work fine:

import scala.collection.mutable.MutableList
val lst = MutableList[(String,String,Double)]()
(0 to 10000).foreach(i => lst :+ ("10", "10", i.toDouble))

or just

val lst = (0 to 10000).map(i => ("10", "10", i.toDouble))

If you just need this to work, maybe those are better alternatives anyway.
You can also check whether it works without the shell, as I suspect
that's a factor.

It's not an error in Spark per se but saying that something's default
Java serialization graph is very deep, so it's like the code you wrote
plus the closure cleaner ends up pulling in some huge linked list and
serializing it the direct and unuseful way.

If you have an idea about exactly why it's happening you can open a
JIRA, but arguably it's something that's nice to just work but isn't
to do with Spark per se. Or, have a look at others related to the
closure and shell and you may find this is related to other known
behavior.


On Sun, Aug 30, 2015 at 8:08 PM, Ashish Shrowty
<ashish.shrowty@gmail.com> wrote:
> Sean .. does the code below work for you in the Spark shell? Ted got the
> same error -
>
> val a=10
> val lst = MutableList[(String,String,Double)]()
> Range(0,10000).foreach(i=>lst+=(("10","10",i:Double)))
> sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>
> -Ashish
>
>
> On Sun, Aug 30, 2015 at 2:52 PM Sean Owen <sowen@cloudera.com> wrote:
>>
>> I'm not sure how to reproduce it? this code does not produce an error in
>> master.
>>
>> On Sun, Aug 30, 2015 at 7:26 PM, Ashish Shrowty
>> <ashish.shrowty@gmail.com> wrote:
>> > Do you think I should create a JIRA?
>> >
>> >
>> > On Sun, Aug 30, 2015 at 12:56 PM Ted Yu <yuzhihong@gmail.com> wrote:
>> >>
>> >> I got StackOverFlowError as well :-(
>> >>
>> >> On Sun, Aug 30, 2015 at 9:47 AM, Ashish Shrowty
>> >> <ashish.shrowty@gmail.com>
>> >> wrote:
>> >>>
>> >>> Yep .. I tried that too earlier. Doesn't make a difference. Are you
>> >>> able
>> >>> to replicate on your side?
>> >>>
>> >>>
>> >>> On Sun, Aug 30, 2015 at 12:08 PM Ted Yu <yuzhihong@gmail.com>
wrote:
>> >>>>
>> >>>> I see.
>> >>>>
>> >>>> What about using the following in place of variable a ?
>> >>>>
>> >>>>
>> >>>> http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
>> >>>>
>> >>>> Cheers
>> >>>>
>> >>>> On Sun, Aug 30, 2015 at 8:54 AM, Ashish Shrowty
>> >>>> <ashish.shrowty@gmail.com> wrote:
>> >>>>>
>> >>>>> @Sean - Agree that there is no action, but I still get the
>> >>>>> stackoverflowerror, its very weird
>> >>>>>
>> >>>>> @Ted - Variable a is just an int - val a = 10 ... The error
happens
>> >>>>> when I try to pass a variable into the closure. The example
you have
>> >>>>> above
>> >>>>> works fine since there is no variable being passed into the
closure
>> >>>>> from the
>> >>>>> shell.
>> >>>>>
>> >>>>> -Ashish
>> >>>>>
>> >>>>> On Sun, Aug 30, 2015 at 9:55 AM Ted Yu <yuzhihong@gmail.com>
wrote:
>> >>>>>>
>> >>>>>> Using Spark shell :
>> >>>>>>
>> >>>>>> scala> import scala.collection.mutable.MutableList
>> >>>>>> import scala.collection.mutable.MutableList
>> >>>>>>
>> >>>>>> scala> val lst = MutableList[(String,String,Double)]()
>> >>>>>> lst: scala.collection.mutable.MutableList[(String, String,
Double)]
>> >>>>>> =
>> >>>>>> MutableList()
>> >>>>>>
>> >>>>>> scala> Range(0,10000).foreach(i=>lst+=(("10","10",i:Double)))
>> >>>>>>
>> >>>>>> scala> val rdd=sc.makeRDD(lst).map(i=> if(a==10) 1
else 0)
>> >>>>>> <console>:27: error: not found: value a
>> >>>>>>        val rdd=sc.makeRDD(lst).map(i=> if(a==10) 1 else
0)
>> >>>>>>                                           ^
>> >>>>>>
>> >>>>>> scala> val rdd=sc.makeRDD(lst).map(i=> if(i._1==10)
1 else 0)
>> >>>>>> rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1]
at map at
>> >>>>>> <console>:27
>> >>>>>>
>> >>>>>> scala> rdd.count()
>> >>>>>> ...
>> >>>>>> 15/08/30 06:53:40 INFO DAGScheduler: Job 0 finished: count
at
>> >>>>>> <console>:30, took 0.478350 s
>> >>>>>> res1: Long = 10000
>> >>>>>>
>> >>>>>> Ashish:
>> >>>>>> Please refine your example to mimic more closely what your
code
>> >>>>>> actually did.
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>>
>> >>>>>> On Sun, Aug 30, 2015 at 12:24 AM, Sean Owen <sowen@cloudera.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> That can't cause any error, since there is no action
in your first
>> >>>>>>> snippet. Even calling count on the result doesn't cause
an error.
>> >>>>>>> You
>> >>>>>>> must be executing something different.
>> >>>>>>>
>> >>>>>>> On Sun, Aug 30, 2015 at 4:21 AM, ashrowty
>> >>>>>>> <ashish.shrowty@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>> > I am running the Spark shell (1.2.1) in local mode
and I have a
>> >>>>>>> > simple
>> >>>>>>> > RDD[(String,String,Double)] with about 10,000 objects
in it. I
>> >>>>>>> > get
>> >>>>>>> > a
>> >>>>>>> > StackOverFlowError each time I try to run the following
code
>> >>>>>>> > (the
>> >>>>>>> > code
>> >>>>>>> > itself is just representative of other logic where
I need to
>> >>>>>>> > pass
>> >>>>>>> > in a
>> >>>>>>> > variable). I tried broadcasting the variable too,
but no luck ..
>> >>>>>>> > missing
>> >>>>>>> > something basic here -
>> >>>>>>> >
>> >>>>>>> > val rdd = sc.makeRDD(List(<Data read from file>)
>> >>>>>>> > val a=10
>> >>>>>>> > rdd.map(r => if (a==10) 1 else 0)
>> >>>>>>> > This throws -
>> >>>>>>> >
>> >>>>>>> > java.lang.StackOverflowError
>> >>>>>>> >     at
>> >>>>>>> > java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:318)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1133)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>> >>>>>>> >     at
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>> >>>>>>> > ...
>> >>>>>>> > ...
>> >>>>>>> >
>> >>>>>>> > More experiments  .. this works -
>> >>>>>>> >
>> >>>>>>> > val lst = Range(0,10000).map(i=>("10","10",i:Double)).toList
>> >>>>>>> > sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>> >>>>>>> >
>> >>>>>>> > But below doesn't and throws the StackoverflowError
-
>> >>>>>>> >
>> >>>>>>> > val lst = MutableList[(String,String,Double)]()
>> >>>>>>> > Range(0,10000).foreach(i=>lst+=(("10","10",i:Double)))
>> >>>>>>> > sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>> >>>>>>> >
>> >>>>>>> > Any help appreciated!
>> >>>>>>> >
>> >>>>>>> > Thanks,
>> >>>>>>> > Ashish
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > --
>> >>>>>>> > View this message in context:
>> >>>>>>> >
>> >>>>>>> > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-shell-and-StackOverFlowError-tp24508.html
>> >>>>>>> > Sent from the Apache Spark User List mailing list
archive at
>> >>>>>>> > Nabble.com.
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > ---------------------------------------------------------------------
>> >>>>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >>>>>>> > For additional commands, e-mail: user-help@spark.apache.org
>> >>>>>>> >
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> ---------------------------------------------------------------------
>> >>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>
>> >

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message