spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyle Ellrott <>
Subject Re: Incremental Updates to an RDD
Date Mon, 09 Dec 2013 18:46:36 GMT
I'd like to use Spark as an analytical stack, the only difference is that I
would like find the best way to connect it to a dataset that I'm actively
working on. Perhaps saying 'updates to an RDD' is a bit of a loaded term, I
don't need the 'resilient', just a distributed data set.
Right now, the best way I can think of doing that is working with the data
in a distributed system, like HBase, then when I want to do my analytics, I
use the HadoopInputFormat readers to transfer the data from the HBase
system to Spark and then do my analytics. Of course, then I have the
overhead of serialization/deserialization and network transfer before I can
even start my calculations. If I already held the dataset in the Spark
processes, then I could start calculations immediately.
So is there is a 'better' way to manage a distributed data set, which would
then serve as an input to Spark RDDs?


On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen <> wrote:

> Kyle, the fundamental contract of a Spark RDD is that it is immutable.
> This follows the paradigm where data is (functionally) transformed into
> other data, rather than mutated. This allows these systems to make certain
> assumptions and guarantees that otherwise they wouldn't be able to.
> Now we've been able to get mutative behavior with RDDs---for fun,
> almost---but that's implementation dependent and may break at any time.
> It turns out this behavior is quite appropriate for the analytic stack,
> where you typically apply the same transform/operator to all data. You're
> finding that transactional systems are the exact opposite, where you
> typically apply a different operation to individual pieces of the data.
> Incidentally this is also the dichotomy between column- and row-based
> storage being optimal for each respective pattern.
> Spark is intended for the analytic stack. To use Spark as the persistence
> layer of a transaction system is going to be very awkward. I know there are
> some vendors who position their in-memory databases as good for both OLTP
> and OLAP use cases, but when you talk to them in depth they will readily
> admit that it's really optimal for one and not the other.
> If you want to make a project out of making a special Spark RDD that
> supports this behavior, it might be interesting. But there will be no
> simple shortcuts to get there from here.
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <>
> On Fri, Dec 6, 2013 at 10:56 PM, Kyle Ellrott <>wrote:
>> I'm trying to figure out if I can use an RDD to backend an interactive
>> server. One of the requirements would be to have incremental updates to
>> elements in the RDD, ie transforms that change/add/delete a single element
>> in the RDD.
>> It seems pretty drastic to do a full RDD filter to remove a single
>> element, or do the union of the RDD with another one of size 1 to add an
>> element. (Or is it?) Is there an efficient way to do this in Spark? Are
>> there any example of this kind of usage?
>> Thank you,
>> Kyle

View raw message