spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: Consistency between RDD's and Native File System
Date Fri, 17 Jan 2014 05:09:43 GMT
I don't agree entirely, Christopher.  Without persisting or checkpointing
RDDs, re-evaluation of the lineage will pick up source changes.  I'm not
saying that working this way is a good idea (in fact, it's generally not),
but you can do things like this:

1) Create file silliness.txt containing:

one line
two line
red line
blue line

2) Fire up spark-shell and do this:

scala> val lines = sc.textFile("silliness.txt")
scala> println(lines.collect.mkString(", "))
.
.
.
one line, two line, red line, blue line

3) Edit silliness.txt so that it is now:

and now
for something
completely
different

4) Continue on with spark-shell:

scala> println(lines.collect.mkString(", "))
.
.
.
and now, for something, completely, different


On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen <ctn@adatao.com> wrote:

> Sai, from your question, I infer that you have an interpretation that RDDs
> are somehow an in-memory/cached copy of the underlying data source---and so
> there is some expectation that there is some synchronization model between
> the two.
>
> That would not be what the RDD model is. RDDs are first-class, stand-alone
> (distributed, immutable) datasets. Once created, an RDD exists on its own
> and isn't expected to somehow automatically realize that some underlying
> source has changed. (There is the concept of lineage or provenance for
> recomputation of RDDs, but that's orthogonal to this interpretation so I
> won't muddy the issue here).
>
> If you're looking for a mutable data table model, we will soon be
> releasing to open source something called Distributed DataFrame (DDF, based
> on R's data.frame) on top of RDDs that allows you to, among other useful
> things, load a dataset, perform transformations on it, and save it back,
> all the while holding on to a single DDF reference.
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Thu, Jan 16, 2014 at 7:33 PM, Sai Prasanna <ansaiprasanna@gmail.com>wrote:
>
>> Thanks Patrick, but i think i dint put my question clearly...
>>
>> The question is Say in the native file system or HDFS, i have data
>> describing students who passed, failed and for whom results are with-held
>> for some reason.
>> *Time T1:*
>> x - Pass
>> y - Fail
>> z - With-held.
>>
>> *Time T2:*
>> So i create an RDD1 reflecting this data, run a query to find how many
>> candidates have passed.
>> RESULT = 1. RDD1 is cached or its stored in the file system depending on
>> the availability of space.
>>
>> *Time T3:*
>> In the native file system, now that results of the z are out and declared
>> passed. So HDFS will need to be modified.
>> x - Pass
>> y - Fail
>> z - Pass.
>> Say now i get the RDD1 that is there in file system or cached copy and
>> run the same query, i get the RESULT = 1, but ideally RESULT is 2.
>>
>> So i was asking is there a way SPARK hints that RDD1 is no longer
>> consistent with the file system or that its upto the programmer to recreate
>> the RDD1 if the block from where RDD was created was changed at a later
>> point of time.
>> [T1 < T2 < T3 < T4]
>>
>> Thanks in advance...
>>
>>
>> On Fri, Jan 17, 2014 at 1:42 AM, Patrick Wendell <pwendell@gmail.com>wrote:
>>
>>> RDD's are immutable, so there isn't really such a thing as modifying a
>>> block in-place inside of an RDD. As a result, this particular
>>> consistency issue doesn't come up in Spark.
>>>
>>> - Patrick
>>>
>>> On Thu, Jan 16, 2014 at 1:42 AM, SaiPrasanna <sai.annamalai@siemens.com>
>>> wrote:
>>> > Hello, i am a novice to SPARK
>>> >
>>> > Say that we have created an RDD1 from native file system/HDFS and done
>>> some
>>> > transformations and actions and that resulted in an RDD2. Lets assume
>>> RDD1
>>> > and RDD2 are persisted, cached in-memory. If the block from where RDD1
>>> was
>>> > created was modified at time T1 and RDD1/RDD2 is accessed later at T2
>>> > T1,
>>> > is there a way either SPARK ensures consistency or it is upto the
>>> programmer
>>> > to make it explicit?
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Consistency-between-RDD-s-and-Native-File-System-tp583.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>
>>
>>
>> --
>> *Sai Prasanna. AN*
>> *II M.Tech (CS), SSSIHL*
>>
>>
>> * Entire water in the ocean can never sink a ship, Unless it gets inside.
>> All the pressures of life can never hurt you, Unless you let them in.*
>>
>
>

Mime
View raw message