spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From john levingston <john2levings...@gmail.com>
Subject Streaming aggregation
Date Tue, 24 Jun 2014 15:06:43 GMT
I have a use case where I cannot figure out the spark streaming way to do
it.

Given two kafka topics corresponding to two different types of events A and
B.  For each element  from topic A correspond an element from topic B.
Unfortunately elements can arrive separately by hours.


The aggregation operation is not deterministic so I do not have a common
key but instead a list of rules which will select the best candidates
already arrived (most commonly one or two elements). The other candidates
should not be discarded as they have also a corresponding elements. Once
the two elements are aggregated they will be published to a kafka topic.

Is the spark streaming way to do it, a group by key over all the RDD and
then applying our aggregation function?

How would I remove the aggregate elements with a tag filtering?  will that
be costly?

Same question how would I remove candidates older than two days?

If a worker fails for a short time and then come back on line what would
happen?

Thank you.

Mime
View raw message