spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Mittal <sjmit...@gmail.com>
Subject Re: How can we connect RDD from previous job to next job
Date Mon, 29 Aug 2016 08:27:51 GMT
I understood the approach.
Does spark 1.6 support Parquet format, I mean saving and loading from
Parquet file.

Also if I use checkpoint, what I understand is that RDD location on
filesystem is not removed when job is over. So I can read that RDD in next
job.
Is that one of the usecase of checkpoint. Basically does my current problem
can be solved using checkpoint.

Also which option would be better, store the output of RDD to a persistent
storage, or store the new RDD of that ouput itself using checkpoint.

Thanks
Sachin




On Mon, Aug 29, 2016 at 1:39 PM, Sean Owen <sowen@cloudera.com> wrote:

> You just save the data in the RDD in whatever form you want to
> whatever persistent storage you want, and then re-read it from another
> job. This could be Parquet format on HDFS for example. Parquet is just
> a common file format. There is no need to keep the job running just to
> keep an RDD alive.
>
> On Mon, Aug 29, 2016 at 5:30 AM, Sachin Mittal <sjmittal@gmail.com> wrote:
> > Hi,
> > I would need some thoughts or inputs or any starting point to achieve
> > following scenario.
> > I submit a job using spark-submit with a certain set of parameters.
> >
> > It reads data from a source, does some processing on RDDs and generates
> some
> > output and completes.
> >
> > Then I submit same job again with next set of parameters.
> > It should also read data from a source do same processing and at the same
> > time read data from the result generated by previous job and merge the
> two
> > and again store the results.
> >
> > This process goes on and on.
> >
> > So I need to store RDD or output of RDD into some storage of previous
> job to
> > make it available to next job.
> >
> > What are my options.
> > 1. Use checkpoint
> > Can I use checkpoint on the final stage of RDD and then load the same RDD
> > again by specifying checkpoint path in next job. Is checkpoint right for
> > this kind of situation.
> >
> > 2. Save output of previous job into some json file and then create a data
> > frame of that in next job.
> > Have I got this right, is this option better than option 1.
> >
> > 3. I have heard a lot about paquet files. However I don't know how it
> > integrates with spark.
> > Can I use that here as intermediate storage.
> > Is this available in spark 1.6?
> >
> > Any other thoughts or idea.
> >
> > Thanks
> > Sachin
> >
> >
> >
> >
>

Mime
View raw message