spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antonin Delpeuch (lists)" <li...@antonin.delpeuch.eu>
Subject Re: Async RDD saves
Date Sat, 08 Aug 2020 10:03:05 GMT
Hi both,

Thanks for your replies!

Sean, your proposal to use a driver-side future wrapping the blocking
call sounds a lot easier indeed.

But I want to ensure that canceling the future in the driver code kills
the corresponding tasks on all executors. If I wrap the driver-side call
in a standard Scala or Java future it will not be cancelable, will it? I
think I would need to interrupt the thread that executes the future somehow.

As you can see I am far from an expert on this topic, sorry if I
misunderstood your proposal.

Cheers,
Antonin


On 07/08/2020 19:53, Edward Mitchell wrote:
> I will agree that the side effects of using Futures in driver code tend
> to be tricky to track down.
> 
> If you forget to clear the job description and job group information,
> when the LocalProperties on the SparkContext remain intact -
> SparkContext#submitJob makes sure to pass down the localProperties.
> 
> This has led to us doing this hack:
> 
> image.png
> 
> This can also cause problems with Spark Streaming where the Streaming UI
> can get messed up from the various streaming related properties set
> getting cleared or re-used.
> 
> On Fri, Aug 7, 2020 at 10:38 AM Sean Owen <srowen@gmail.com
> <mailto:srowen@gmail.com>> wrote:
> 
>     Why do you need to do it, and can you just use a future in your
>     driver code?
> 
>     On Fri, Aug 7, 2020 at 9:01 AM Antonin Delpeuch (lists)
>     <lists@antonin.delpeuch.eu <mailto:lists@antonin.delpeuch.eu>> wrote:
>     >
>     > Hi all,
>     >
>     > Following my request on the user mailing list [1], there does not seem
>     > to be any simple way to save RDDs to the file system in an
>     asynchronous
>     > way. I am looking into implementing this, so I am first checking
>     whether
>     > there is consensus around the idea.
>     >
>     > The goal would be to add methods such as `saveAsTextFileAsync` and
>     > `saveAsObjectFileAsync` to the RDD API.
>     >
>     > I am thinking about doing this by:
>     >
>     > - refactoring SparkHadoopWriter to allow for submitting jobs
>     > asynchronously (with `submitJob` rather than `runJob`)
>     >
>     > - add a `saveAsHadoopFileAsync` method in `PairRDDFunctions`,
>     > counterpart to the existing `saveAsHadoopFile`
>     >
>     > - add a `saveAsTextFileAsync` (and other formats) in
>     `AsyncRDDActions`.
>     >
>     > Because SparkHadoopWriter is private, it is complicated to reimplement
>     > this functionality outside of Spark as a user, so I think this
>     would be
>     > an API worth offering. It should be possible to implement this without
>     > too much code duplication hopefully.
>     >
>     > Cheers,
>     >
>     > Antonin
>     >
>     > [1]:
>     >
>     http://apache-spark-user-list.1001560.n3.nabble.com/Async-API-to-save-RDDs-td38320.html
>     >
>     >
>     >
>     > ---------------------------------------------------------------------
>     > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>     <mailto:dev-unsubscribe@spark.apache.org>
>     >
> 
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>     <mailto:dev-unsubscribe@spark.apache.org>
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message