spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
Date Sun, 16 Oct 2016 12:51:20 GMT


Sean Owen commented on SPARK-650:

Yeah that's a decent use case, because latency is an issue (streaming) and you potentially
have time to set up before latency matters. 

You can still use this approach because empty RDDs arrive if no data has, and empty RDDs can
still be repartitioned. Here's a way to, once, if the first RDD has no data, do something
once per partition, which ought to amount to at least once per executor:

var first = true
lines.foreachRDD { rdd =>
  if (first) {
    if (rdd.isEmpty) {
      rdd.repartition(sc.defaultParallelism).foreachPartition(_ => Thing.initOnce())
    first = false

"ought", because, there isn't actually a guarantee that it will put the empty partitions on
different executors. In practice, it seems to, when I just tried it.

That's a partial solution, but it's an optimization anyway, and maybe it helps you right now.
I am still not sure it means this needs a whole mechanism, if this is the only type of use
case. Maybe there are others.

> Add a "setup hook" API for running initialization code on each executor
> -----------------------------------------------------------------------
>                 Key: SPARK-650
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Matei Zaharia
>            Priority: Minor
> Would be useful to configure things like reporting libraries

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message