spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roger Marin <ro...@rogersmarin.com>
Subject Re: How can we connect RDD from previous job to next job
Date Mon, 29 Aug 2016 04:39:06 GMT
Hi Sachin,

Have a look at the spark job server project, it allows you to share rdds &
dataframes between spark jobs running in the same context, the catch is you
have to implement your spark job as a spark job server spark job.

https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md

Regards,
Roger

On 29 Aug 2016 14:30, "Sachin Mittal" <sjmittal@gmail.com> wrote:

> Hi,
> I would need some thoughts or inputs or any starting point to achieve
> following scenario.
> I submit a job using spark-submit with a certain set of parameters.
>
> It reads data from a source, does some processing on RDDs and generates
> some output and completes.
>
> Then I submit same job again with next set of parameters.
> It should also read data from a source do same processing and at the same
> time read data from the result generated by previous job and merge the two
> and again store the results.
>
> This process goes on and on.
>
> So I need to store RDD or output of RDD into some storage of previous job
> to make it available to next job.
>
> What are my options.
> 1. Use checkpoint
> Can I use checkpoint on the final stage of RDD and then load the same RDD
> again by specifying checkpoint path in next job. Is checkpoint right for
> this kind of situation.
>
> 2. Save output of previous job into some json file and then create a data
> frame of that in next job.
> Have I got this right, is this option better than option 1.
>
> 3. I have heard a lot about paquet files. However I don't know how it
> integrates with spark.
> Can I use that here as intermediate storage.
> Is this available in spark 1.6?
>
> Any other thoughts or idea.
>
> Thanks
> Sachin
>
>
>
>
>

Mime
View raw message