spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rekt...@voodoowarez.com
Subject Re: Create DStream consisting of HDFS and (then) Kafka data
Date Thu, 08 Jan 2015 05:19:13 GMT
I've started 1 or 2 emails to ask more broadly- what are good practices
for doing DStream computations in a non-realtime fashion? I'd love to have a
good intro article to pass around to people, and some resources for those
others chasing this problem.

Back when I was working with Storm, managing the flow of time by passing it as
a input and output field through individual components was an absolute necessity
for us- first, it was needed just to run the analytics with consistentcy, and
second I just thought we'd be mad to build a system that could only run at real
time speed with real-time data, so making time a first class piece of data
flowing through was an obvious move that fixed both.

Spark DStreams on the other hand have a much more discrete sense of time (see
what I did there?). I feel like there's pretty good coverage of the straight-
forward realtime use, but to really be interesting & deployable, getting
a better understanding for running DStream in non-realtime fashion (after the
fact replay at faster than realtime), understanding what DStream wants and
needs and some writeup of gotchas is really integral to making DStreams
compatible tackling the "lambda architecture" broadly & well.

2c. Thanks a dozen for the way more nuanced super interesting question Tobias!
Just writing in to say that getting even where Tobias is- dstream processing bulk
HDFS data- is something I don't feel is super well socialized yet, & fingers
crossed that base gets built up a little more.

-rektide


On Thu, Jan 08, 2015 at 01:53:21PM +0900, Tobias Pfeiffer wrote:
> Hi,
> 
> I have a setup where data from an external stream is piped into Kafka and
> also written to HDFS periodically for long-term storage. Now I am trying to
> build an application that will first process the HDFS files and then switch
> to Kafka, continuing with the first item that was not yet in HDFS. (The
> items have an increasing timestamp that I can use to find the "first item
> not yet processed".) I am wondering what an elegant method to provide a
> unified view of the data would be.
> 
> Currently, I am using two StreamingContexts one after another:
>  - start one StreamingContext A and process all data found in HDFS
> (updating the largest seen timestamp in an accumulator), stopping when
> there was an RDD with 0 items in it,
>  - stop that StreamingContext A,
>  - start a new StreamingContext B and process the Kafka stream (filtering
> out all items with a timestamp smaller than the value in the accumulator),
>  - stop when the user requests it.
> 
> This *works* as it is now, but I was planning to add sliding windows (based
> on item counts or the timestamps in the data), which will get unmanageably
> complicated when I have a window spanning data in both HDFS and Kafka, I
> guess. Therefore I would like to have a single DStream that is fed with
> first HDFS and then Kafka data. Does anyone have a suggestion on how to
> realize that (with as few copying of private[spark] classes as possible)? I
> guess one issue is that the Kafka processing requires a receiver and
> therefore needs to be treated quite a bit differently than HDFS?
> 
> Thanks
> Tobias

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message