spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RK Aduri <rkad...@collectivei.com>
Subject Re: Are RDD's ever persisted to disk?
Date Tue, 23 Aug 2016 23:37:59 GMT
Can you come up with your complete analysis? A snapshot of what you think the code is doing.
May be that would help us understand what exactly you were trying to convey.


> On Aug 23, 2016, at 4:21 PM, kant kodali <kanth909@gmail.com> wrote:
> 
> 
>  <https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L229>	
> apache/spark
>  <https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L229>
> spark - Mirror of Apache Spark
>  <https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L229>
> GITHUB.COM
>  <https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L229>	 <https://mixmax.com/r/aMyLuMpcgLtL2LPwR>
> 
> 
> 
> 
> On Tue, Aug 23, 2016 4:17 PM, kant kodali kanth909@gmail.com <mailto:kanth909@gmail.com>
wrote:
> @RK you may want to look more deeply if you are curious. the code starts from here 
> 
> 
>  <https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L254>	
> apache/spark
>  <https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L254>
> spark - Mirror of Apache Spark
>  <https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L254>
> GITHUB.COM
>  <https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L254>	 <https://mixmax.com/r/aMyLuMpcgLtL2LPwR>
> 
> 
> and it goes here where it is trying to save the python code object(which is a byte code)
> 
> 
>  <https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241>

> apache/spark
>  <https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241>
> spark - Mirror of Apache Spark
>  <https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241>
> GITHUB.COM
>  <https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241>
 <https://mixmax.com/r/aMyLuMpcgLtL2LPwR>
> 
> 
> 
> 
> On Tue, Aug 23, 2016 2:39 PM, RK Aduri rkaduri@collectivei.com <mailto:rkaduri@collectivei.com>
wrote:
> I just had a glance. AFAIK, that is nothing do with RDDs. It’s a pickler used to serialize
and deserialize the python code.
> 
>> On Aug 23, 2016, at 2:23 PM, kant kodali <kanth909@gmail.com <mailto:kanth909@gmail.com>>
wrote:
>> 
>> @Sean 
>> 
>> well this makes sense but I wonder what the following source code is doing?
>> 
>> 
>>  <https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241>

>> apache/spark
>>  <https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241>
>> spark - Mirror of Apache Spark
>>  <https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241>
>> GITHUB.COM
>>  <https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241>
 <https://mixmax.com/r/aMyLuMpcgLtL2LPwR>
>> 
>> 
>> This code looks like it is trying to store some byte code some where (whether its
memory or disk) but why even go this path like creating a code objects so it can be executed
later and so on after all we are trying to persist the result of computing the RDD" ?
>> 
>> 
>> 
>> 
>> 
>> On Tue, Aug 23, 2016 1:42 PM, Sean Owen sowen@cloudera.com <mailto:sowen@cloudera.com>
wrote:
>> We're probably mixing up some semantics here. An RDD is indeed,
>> 
>> really, just some bookkeeping that records how a certain result is
>> 
>> computed. It is not the data itself.
>> 
>> 
>> 
>> However we often talk about "persisting an RDD" which means
>> 
>> "persisting the result of computing the RDD" in which case that
>> 
>> persisted representation can be used instead of recomputing it.
>> 
>> 
>> 
>> The result of computing an RDD is really some objects in memory. It's
>> 
>> possible to persist the RDD in memory by just storing these objects in
>> 
>> memory as cached partitions. This involves no serialization.
>> 
>> 
>> 
>> Data can be persisted to disk but this involves serializing objects to
>> 
>> bytes (not byte code). It's also possible to store a serialized
>> 
>> representation in memory because it may be more compact.
>> 
>> 
>> 
>> This is not the same as saving/writing an RDD to persistent storage as
>> 
>> text or JSON or whatever.
>> 
>> 
>> 
>> On Tue, Aug 23, 2016 at 9:28 PM, kant kodali <kanth909@gmail.com <mailto:kanth909@gmail.com>>
wrote:
>> 
>> > @srkanth are you sure? the whole point of RDD's is to store transformations
>> 
>> > but not the data as the spark paper points out but I do lack the practical
>> 
>> > experience for me to confirm. when I looked at the spark source
>> 
>> > code(specifically the checkpoint code) a while ago it was clearly storing
>> 
>> > some JVM byte code to disk which I thought were the transformations.
>> 
>> >
>> 
>> >
>> 
>> >
>> 
>> > On Tue, Aug 23, 2016 1:11 PM, srikanth.jella@gmail.com <mailto:srikanth.jella@gmail.com>
wrote:
>> 
>> >>
>> 
>> >> RDD contains data but not JVM byte code i.e. data which is read from
>> 
>> >> source and transformations have been applied. This is ideal case to persist
>> 
>> >> RDDs.. As Nirav mentioned this data will be serialized before persisting
to
>> 
>> >> disk..
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> Thanks,
>> 
>> >> Sreekanth Jella
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> From: kant kodali
>> 
>> >> Sent: Tuesday, August 23, 2016 3:59 PM
>> 
>> >> To: Nirav
>> 
>> >> Cc: RK Aduri; srikanth.jella@gmail.com <mailto:srikanth.jella@gmail.com>;
user@spark.apache.org <mailto:user@spark.apache.org>
>> >> Subject: Re: Are RDD's ever persisted to disk?
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> Storing RDD to disk is nothing but storing JVM byte code to disk (in case
>> 
>> >> of Java or Scala). am I correct?
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> On Tue, Aug 23, 2016 12:55 PM, Nirav niravcp@gmail.com <mailto:niravcp@gmail.com>
wrote:
>> 
>> >>
>> 
>> >> You can store either in serialized form(butter array) or just save it in
a
>> 
>> >> string format like tsv or csv. There are different RDD save apis for that.
>> 
>> >>
>> 
>> >> Sent from my iPhone
>> 
>> >>
>> 
>> >>
>> 
>> >> On Aug 23, 2016, at 12:26 PM, kant kodali <kanth909@gmail.com <mailto:kanth909@gmail.com>>
wrote:
>> 
>> >>
>> 
>> >> ok now that I understand RDD can be stored to the disk. My last question
>> 
>> >> on this topic would be this.
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> Storing RDD to disk is nothing but storing JVM byte code to disk (in case
>> 
>> >> of Java or Scala). am I correct?
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> On Tue, Aug 23, 2016 12:19 PM, RK Aduri rkaduri@collectivei.com <mailto:rkaduri@collectivei.com>
wrote:
>> 
>> >>
>> 
>> >> On an other note, if you have a streaming app, you checkpoint the RDDs so
>> 
>> >> that they can be accessed in case of a failure. And yes, RDDs are persisted
>> 
>> >> to DISK. You can access spark’s UI and see it listed under Storage tab.
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> If RDDs are persisted in memory, you avoid any disk I/Os so that any
>> 
>> >> lookups will be cheap. RDDs are reconstructed based on a graph (DAG -
>> 
>> >> available in Spark UI )
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> On Aug 23, 2016, at 12:10 PM, <srikanth.jella@gmail.com <mailto:srikanth.jella@gmail.com>>
>> 
>> >> <srikanth.jella@gmail.com <mailto:srikanth.jella@gmail.com>>
wrote:
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> RAM or Virtual memory is finite, so data size needs to be considered
>> 
>> >> before persist. Please see below documentation when to choose the
>> 
>> >> persistency level.
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
<http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose>
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> Thanks,
>> 
>> >> Sreekanth Jella
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> From: kant kodali
>> 
>> >> Sent: Tuesday, August 23, 2016 2:42 PM
>> 
>> >> To: srikanth.jella@gmail.com <mailto:srikanth.jella@gmail.com>
>> >> Cc: user@spark.apache.org <mailto:user@spark.apache.org>
>> >> Subject: Re: Are RDD's ever persisted to disk?
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> so when do we ever need to persist RDD on disk? given that we don't need
>> 
>> >> to worry about RAM(memory) as virtual memory will just push pages to the
>> 
>> >> disk when memory becomes scarce.
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> On Tue, Aug 23, 2016 11:23 AM, srikanth.jella@gmail.com <mailto:srikanth.jella@gmail.com>
wrote:
>> 
>> >>
>> 
>> >> Hi Kant Kodali,
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> Based on the input parameter to persist() method either it will be cached
>> 
>> >> on memory or persisted to disk. In case of failures Spark will reconstruct
>> 
>> >> the RDD on a different executor based on the DAG. That is how failures are
>> 
>> >> handled. Spark Core does not replicate the RDDs as they can be reconstructed
>> 
>> >> from the source (let’s say HDFS, Hive or S3 etc.) but not from memory
(which
>> 
>> >> is lost already).
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> Thanks,
>> 
>> >> Sreekanth Jella
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> From: kant kodali
>> 
>> >> Sent: Tuesday, August 23, 2016 2:12 PM
>> 
>> >> To: user@spark.apache.org <mailto:user@spark.apache.org>
>> >> Subject: Are RDD's ever persisted to disk?
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> I am new to spark and I keep hearing that RDD's can be persisted to memory
>> 
>> >> or disk after each checkpoint. I wonder why RDD's are persisted in memory?
>> 
>> >> In case of node failure how would you access memory to reconstruct the RDD?
>> 
>> >> persisting to disk make sense because its like persisting to a Network file
>> 
>> >> system (in case of HDFS) where a each block will have multiple copies across
>> 
>> >> nodes so if a node goes down RDD's can still be reconstructed by the reading
>> 
>> >> the required block from other nodes and recomputing it but my biggest
>> 
>> >> question is Are RDD's ever persisted to disk?
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >> Collective[i] dramatically improves sales and marketing performance using
>> 
>> >> technology, applications and a revolutionary network designed to provide
>> 
>> >> next generation analytics and decision-support directly to business users.
>> 
>> >> Our goal is to maximize human potential and minimize mistakes. In most
>> 
>> >> cases, the results are astounding. We cannot, however, stop emails from
>> 
>> >> sometimes being sent to the wrong person. If you are not the intended
>> 
>> >> recipient, please notify us by replying to this email's sender and deleting
>> 
>> >> it (and any attachments) permanently from your system. If you are, please
>> 
>> >> respect the confidentiality of this communication's contents.
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
>> >>
>> 
> 
> 
> Collective[i] dramatically improves sales and marketing performance using technology,
applications and a revolutionary network designed to provide next generation analytics and
decision-support directly to business users. Our goal is to maximize human potential and minimize
mistakes. In most cases, the results are astounding. We cannot, however, stop emails from
sometimes being sent to the wrong person. If you are not the intended recipient, please notify
us by replying to this email's sender and deleting it (and any attachments) permanently from
your system. If you are, please respect the confidentiality of this communication's contents.


-- 
Collective[i] dramatically improves sales and marketing performance using 
technology, applications and a revolutionary network designed to provide 
next generation analytics and decision-support directly to business users. 
Our goal is to maximize human potential and minimize mistakes. In most 
cases, the results are astounding. We cannot, however, stop emails from 
sometimes being sent to the wrong person. If you are not the intended 
recipient, please notify us by replying to this email's sender and deleting 
it (and any attachments) permanently from your system. If you are, please 
respect the confidentiality of this communication's contents.

Mime
View raw message