spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: How can we connect RDD from previous job to next job
Date Mon, 29 Aug 2016 08:43:12 GMT
If you mean to persist data in an RDD, then you should do just that --
persist the RDD to durable storage so it can be read later by any
other app. Checkpointing is not a way to store RDDs, but a specific
way to recover the same application in some cases. Parquet has been
supported for a long while, yes. It's the most common binary format.
You could also literally store the serialized form of your objects.

On Mon, Aug 29, 2016 at 9:27 AM, Sachin Mittal <sjmittal@gmail.com> wrote:
> I understood the approach.
> Does spark 1.6 support Parquet format, I mean saving and loading from
> Parquet file.
>
> Also if I use checkpoint, what I understand is that RDD location on
> filesystem is not removed when job is over. So I can read that RDD in next
> job.
> Is that one of the usecase of checkpoint. Basically does my current problem
> can be solved using checkpoint.
>
> Also which option would be better, store the output of RDD to a persistent
> storage, or store the new RDD of that ouput itself using checkpoint.
>
> Thanks
> Sachin
>
>
>
>
> On Mon, Aug 29, 2016 at 1:39 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>> You just save the data in the RDD in whatever form you want to
>> whatever persistent storage you want, and then re-read it from another
>> job. This could be Parquet format on HDFS for example. Parquet is just
>> a common file format. There is no need to keep the job running just to
>> keep an RDD alive.
>>
>> On Mon, Aug 29, 2016 at 5:30 AM, Sachin Mittal <sjmittal@gmail.com> wrote:
>> > Hi,
>> > I would need some thoughts or inputs or any starting point to achieve
>> > following scenario.
>> > I submit a job using spark-submit with a certain set of parameters.
>> >
>> > It reads data from a source, does some processing on RDDs and generates
>> > some
>> > output and completes.
>> >
>> > Then I submit same job again with next set of parameters.
>> > It should also read data from a source do same processing and at the
>> > same
>> > time read data from the result generated by previous job and merge the
>> > two
>> > and again store the results.
>> >
>> > This process goes on and on.
>> >
>> > So I need to store RDD or output of RDD into some storage of previous
>> > job to
>> > make it available to next job.
>> >
>> > What are my options.
>> > 1. Use checkpoint
>> > Can I use checkpoint on the final stage of RDD and then load the same
>> > RDD
>> > again by specifying checkpoint path in next job. Is checkpoint right for
>> > this kind of situation.
>> >
>> > 2. Save output of previous job into some json file and then create a
>> > data
>> > frame of that in next job.
>> > Have I got this right, is this option better than option 1.
>> >
>> > 3. I have heard a lot about paquet files. However I don't know how it
>> > integrates with spark.
>> > Can I use that here as intermediate storage.
>> > Is this available in spark 1.6?
>> >
>> > Any other thoughts or idea.
>> >
>> > Thanks
>> > Sachin
>> >
>> >
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message