spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Mitchell <edee...@gmail.com>
Subject Re: Async RDD saves
Date Fri, 07 Aug 2020 17:53:41 GMT
I will agree that the side effects of using Futures in driver code tend to
be tricky to track down.

If you forget to clear the job description and job group information, when
the LocalProperties on the SparkContext remain intact -
SparkContext#submitJob makes sure to pass down the localProperties.

This has led to us doing this hack:

[image: image.png]

This can also cause problems with Spark Streaming where the Streaming UI
can get messed up from the various streaming related properties set getting
cleared or re-used.

On Fri, Aug 7, 2020 at 10:38 AM Sean Owen <srowen@gmail.com> wrote:

> Why do you need to do it, and can you just use a future in your driver
> code?
>
> On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
> <lists@antonin.delpeuch.eu> wrote:
> >
> > Hi all,
> >
> > Following my request on the user mailing list [1], there does not seem
> > to be any simple way to save RDDs to the file system in an asynchronous
> > way. I am looking into implementing this, so I am first checking whether
> > there is consensus around the idea.
> >
> > The goal would be to add methods such as `saveAsTextFileAsync` and
> > `saveAsObjectFileAsync` to the RDD API.
> >
> > I am thinking about doing this by:
> >
> > - refactoring SparkHadoopWriter to allow for submitting jobs
> > asynchronously (with `submitJob` rather than `runJob`)
> >
> > - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
> > counterpart to the existing `saveAsHadoopFile`
> >
> > - add a `saveAsTextFileAsync` (and other formats) in `AsyncRDDActions`.
> >
> > Because SparkHadoopWriter is private, it is complicated to reimplement
> > this functionality outside of Spark as a user, so I think this would be
> > an API worth offering. It should be possible to implement this without
> > too much code duplication hopefully.
> >
> > Cheers,
> >
> > Antonin
> >
> > [1]:
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Mime
View raw message