spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Nguyen <...@adatao.com>
Subject Re: Incremental Updates to an RDD
Date Tue, 10 Dec 2013 20:42:49 GMT
Wes, it depends on what you mean by "sliding window" as related to "RDD":

   1. Some operation over multiple rows of data within a single, large RDD,
   for which the operations are required to be temporally sequential. This may
   be the case where you're computing a running average over historical
   time-based data.
   2. Some operation over multiple rows of data within a single, large RDD,
   for which the operations may be run in parallel, even out of order. This
   may be the case where your RDD represents a two-dimensional geospatial map
   and you're computing something (e.g., population average) over a grid.
   3. Some operation on data streaming in, over a fixed-size window, and
   you would like the representation of that windowed data to be an RDD.

For #1 and #2, there's only one "static" RDD and the task is largely
bookkeeping: tracking which window you're working on when, and dealing with
partition boundaries (*mapPartitions* or *mapPartitionsWithIndex *would be
a useful interface here as it allows you to see multiple rows at a time, as
well as know what partition # you're working with at any given time).

For #3, that's what Spark Streaming does, and it does so by introducing a
higher-level concept of a DStream, which is a sequence of RDDs, where each
RDD is one data sample. Given that it is a collection of RDDs, the
windowing management task simply involves maintaining what RDDs are
contained that sequence.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Tue, Dec 10, 2013 at 12:01 PM, Wes Mitchell <wesm@dancingtrout.net>wrote:

> So, does that mean that if I want to do a sliding window, then I have to,
> in some fashion,
> build a stream from the RDD, push a new value on the head, filter out the
> oldest value, and
> re-persist as an RDD?
>
>
>
>
> On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen <ctn@adatao.com>wrote:
>
>> Kyle, the fundamental contract of a Spark RDD is that it is immutable.
>> This follows the paradigm where data is (functionally) transformed into
>> other data, rather than mutated. This allows these systems to make certain
>> assumptions and guarantees that otherwise they wouldn't be able to.
>>
>> Now we've been able to get mutative behavior with RDDs---for fun,
>> almost---but that's implementation dependent and may break at any time.
>>
>> It turns out this behavior is quite appropriate for the analytic stack,
>> where you typically apply the same transform/operator to all data. You're
>> finding that transactional systems are the exact opposite, where you
>> typically apply a different operation to individual pieces of the data.
>> Incidentally this is also the dichotomy between column- and row-based
>> storage being optimal for each respective pattern.
>>
>> Spark is intended for the analytic stack. To use Spark as the persistence
>> layer of a transaction system is going to be very awkward. I know there are
>> some vendors who position their in-memory databases as good for both OLTP
>> and OLAP use cases, but when you talk to them in depth they will readily
>> admit that it's really optimal for one and not the other.
>>
>> If you want to make a project out of making a special Spark RDD that
>> supports this behavior, it might be interesting. But there will be no
>> simple shortcuts to get there from here.
>>
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao <http://adatao.com>
>> linkedin.com/in/ctnguyen
>>
>>
>>
>> On Fri, Dec 6, 2013 at 10:56 PM, Kyle Ellrott <kellrott@soe.ucsc.edu>wrote:
>>
>>> I'm trying to figure out if I can use an RDD to backend an interactive
>>> server. One of the requirements would be to have incremental updates to
>>> elements in the RDD, ie transforms that change/add/delete a single element
>>> in the RDD.
>>> It seems pretty drastic to do a full RDD filter to remove a single
>>> element, or do the union of the RDD with another one of size 1 to add an
>>> element. (Or is it?) Is there an efficient way to do this in Spark? Are
>>> there any example of this kind of usage?
>>>
>>> Thank you,
>>> Kyle
>>>
>>
>>
>

Mime
View raw message