spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Lam <chiling...@gmail.com>
Subject Re: Broadcast Variable Life Cycle
Date Tue, 30 Aug 2016 15:43:59 GMT
Hi Sean,

Thank you for the response. The only problem is that actively managing
broadcast variables require to return the broadcast variables to the caller
if the function that creates the broadcast variables does not contain any
action. That is the scope that uses the broadcast variables cannot destroy
the broadcast variables in many cases. For example:

==============
def perfromTransformation(rdd: RDD[int]) = {
   val sharedMap = sc.broadcast(map)
   rdd.map{id =>
      val localMap = sharedMap.vlaue
      (id, localMap(id))
   }
}

def main = {
    ....
    performTransformation(rdd).toDF("id",
"i").write.parquet("dummy_example")
}
==============

In this example above, we cannot destroy the sharedMap before the
write.parquet is executed because RDD is evaluated lazily. We will get a
exception if I put sharedMap.destroy like this:

==============
def perfromTransformation(rdd: RDD[int]) = {
   val sharedMap = sc.broadcast(map)
   val result = rdd.map{id =>
      val localMap = sharedMap.vlaue
      (id, localMap(id))
   }
   sharedMap.destroy
   result
}
==============

Am I missing something? Are there better way to do this without returning
the broadcast variables to the main function?

Best Regards,

Jerry



On Mon, Aug 29, 2016 at 12:11 PM, Sean Owen <sowen@cloudera.com> wrote:

> Yes you want to actively unpersist() or destroy() broadcast variables
> when they're no longer needed. They can eventually be removed when the
> reference on the driver is garbage collected, but you usually would
> not want to rely on that.
>
> On Mon, Aug 29, 2016 at 4:30 PM, Jerry Lam <chilinglam@gmail.com> wrote:
> > Hello spark developers,
> >
> > Anyone can shed some lights on the life cycle of the broadcast variables?
> > Basically, if I have a broadcast variable defined in a loop and for each
> > iteration, I provide a different value.
> > // For example:
> > for(i< 1 to 10) {
> >     val bc = sc.broadcast(i)
> >     sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id,
> > i)}.toDF("id", "i").write.parquet("/dummy_output")
> > }
> >
> > Do I need to active manage the broadcast variable in this case? I know
> this
> > example is not real but please imagine this broadcast variable can hold
> an
> > array of 1M Long.
> >
> > Regards,
> >
> > Jerry
> >
> >
> >
> > On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam <chilinglam@gmail.com> wrote:
> >>
> >> Hello spark developers,
> >>
> >> Can someone explain to me what is the lifecycle of a broadcast variable?
> >> When a broadcast variable will be garbage-collected at the driver-side
> and
> >> at the executor-side? Does a spark application need to actively manage
> the
> >> broadcast variables to ensure that it will not run in OOM?
> >>
> >> Best Regards,
> >>
> >> Jerry
> >
> >
>

Mime
View raw message