spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From unorthodox.engine...@gmail.com
Subject Re: Merging all Spark Streaming RDDs to one RDD
Date Thu, 12 Jun 2014 18:43:29 GMT
I have much the same issue.

While I haven't totally solved it yet, I have found the "window" method useful for batching
up archive blocks - but updateStateByKey is probably what we want to use, perhaps multiple
times. If that works.

My bigger worry now is storage. Unlike non-streaming apps, we tend to build up state that
cannot be regenerated, and hadoop files don't seem to be the best solution.

Jeremy Lee   BCompSci (Hons)
The Unorthodox Engineers

> On 10 Jun 2014, at 11:00 am, Henggang Cui <cuihenggang@gmail.com> wrote:
> 
> Hi,
> 
> I'm wondering whether it's possible to continuously merge the RDDs coming from a stream
into a single RDD efficiently.
> 
> One thought is to use the union() method. But using union, I will get a new RDD each
time I do a merge. I don't know how I should name these RDDs, because I remember Spark does
not encourage users to create an array of RDDs.
> 
> Another possible solution is to follow the example of "StatefulNetworkWordCount", which
uses the updateStateByKey() method. But my RDD type is not key value pairs (it's a struct
with multiple fields). Is there a workaround?
> 
> Thanks,
> Cui

Mime
View raw message