spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Nguyen <>
Subject Re: Incremental Updates to an RDD
Date Mon, 09 Dec 2013 19:03:02 GMT
Kyle, many of your design goals are something we also want. Indeed it's
interesting you separate "resilient" from RDD, as I've suggested there
should be ways to boost performance if you're willing to give up some or
all of the "R" guarantees.

We haven't started looking into this yet due to other priorities. If
someone with similar design goals wants to get started that'd be great.

To be sure, a semi-shortcut to what you want may be found by looking at
Tachyon. It's fairly early days for Tachyon so I don't know what its actual
behavior would be under transactional loads.

Sent while mobile. Pls excuse typos etc.
On Dec 9, 2013 10:47 AM, "Kyle Ellrott" <> wrote:

> I'd like to use Spark as an analytical stack, the only difference is that
> I would like find the best way to connect it to a dataset that I'm actively
> working on. Perhaps saying 'updates to an RDD' is a bit of a loaded term, I
> don't need the 'resilient', just a distributed data set.
> Right now, the best way I can think of doing that is working with the data
> in a distributed system, like HBase, then when I want to do my analytics, I
> use the HadoopInputFormat readers to transfer the data from the HBase
> system to Spark and then do my analytics. Of course, then I have the
> overhead of serialization/deserialization and network transfer before I can
> even start my calculations. If I already held the dataset in the Spark
> processes, then I could start calculations immediately.
> So is there is a 'better' way to manage a distributed data set, which
> would then serve as an input to Spark RDDs?
> Kyle
> On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen <>wrote:
>> Kyle, the fundamental contract of a Spark RDD is that it is immutable.
>> This follows the paradigm where data is (functionally) transformed into
>> other data, rather than mutated. This allows these systems to make certain
>> assumptions and guarantees that otherwise they wouldn't be able to.
>> Now we've been able to get mutative behavior with RDDs---for fun,
>> almost---but that's implementation dependent and may break at any time.
>> It turns out this behavior is quite appropriate for the analytic stack,
>> where you typically apply the same transform/operator to all data. You're
>> finding that transactional systems are the exact opposite, where you
>> typically apply a different operation to individual pieces of the data.
>> Incidentally this is also the dichotomy between column- and row-based
>> storage being optimal for each respective pattern.
>> Spark is intended for the analytic stack. To use Spark as the persistence
>> layer of a transaction system is going to be very awkward. I know there are
>> some vendors who position their in-memory databases as good for both OLTP
>> and OLAP use cases, but when you talk to them in depth they will readily
>> admit that it's really optimal for one and not the other.
>> If you want to make a project out of making a special Spark RDD that
>> supports this behavior, it might be interesting. But there will be no
>> simple shortcuts to get there from here.
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao <>
>> On Fri, Dec 6, 2013 at 10:56 PM, Kyle Ellrott <>wrote:
>>> I'm trying to figure out if I can use an RDD to backend an interactive
>>> server. One of the requirements would be to have incremental updates to
>>> elements in the RDD, ie transforms that change/add/delete a single element
>>> in the RDD.
>>> It seems pretty drastic to do a full RDD filter to remove a single
>>> element, or do the union of the RDD with another one of size 1 to add an
>>> element. (Or is it?) Is there an efficient way to do this in Spark? Are
>>> there any example of this kind of usage?
>>> Thank you,
>>> Kyle

View raw message