spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Wampler <deanwamp...@gmail.com>
Subject Re: Multiple operations on same DStream in Spark Streaming
Date Tue, 28 Jul 2015 13:50:24 GMT
Is this average supposed to be across all partitions? If so, it will
require some one of the reduce operations in every batch interval. If
that's too slow for the data rate, I would investigate using
PairDStreamFunctions.updateStateByKey to compute the sum + count of the 2nd
integers, per 1st integer, then do the filtering and final averaging
"downstream" if you can, i.e., where you actually need the final value. If
you need it on every batch iteration, then you'll have to do a reduce per
iteration.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Tue, Jul 28, 2015 at 3:14 AM, Akhil Das <akhil@sigmoidanalytics.com>
wrote:

> One approach would be to store the batch data in an intermediate storage
> (like HBase/MySQL or even in zookeeper), and inside your filter function
> you just go and read the previous value from this storage and do whatever
> operation that you are supposed to do.
>
> Thanks
> Best Regards
>
> On Sun, Jul 26, 2015 at 3:37 AM, foobar <heathguo@fb.com> wrote:
>
>> Hi I'm working with Spark Streaming using scala, and trying to figure out
>> the
>> following problem. In my DStream[(int, int)], each record is an int pair
>> tuple. For each batch, I would like to filter out all records with first
>> integer below average of first integer in this batch, and for all records
>> with first integer above average of first integer in the batch, compute
>> the
>> average of second integers in such records. What's the best practice to
>> implement this? I tried this but kept getting the object not serializable
>> exception because it's hard to share variables (such as average of first
>> int
>> in the batch) between workers and driver. Any suggestions? Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-operations-on-same-DStream-in-Spark-Streaming-tp23995.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Mime
View raw message