spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dimitris Kouzis - Loukas <look...@gmail.com>
Subject Re: Pause Spark Streaming reading or sampling streaming data
Date Thu, 06 Aug 2015 09:33:06 GMT
Re-reading your description - I guess you could potentially make your input
source to connect for 10 seconds, pause for 50 and then reconnect.

On Thu, Aug 6, 2015 at 10:32 AM, Dimitris Kouzis - Loukas <lookfwd@gmail.com
> wrote:

> Hi, - yes - it's great that you wrote it yourself - it means you have more
> control. I have the feeling that the most efficient point to discard as
> much data as possible - or even modify your subscription protocol to - your
> spark input source - not even receive the other 50 seconds of data is the
> most efficient point. After you deliver data to DStream - you might filter
> them as much as you want - but you will still be subject to garbage
> collection and/or potential shuffles/and HDD checkpoints.
>
> On Thu, Aug 6, 2015 at 1:31 AM, Heath Guo <heathguo@fb.com> wrote:
>
>> Hi Dimitris,
>>
>> Thanks for your reply. Just wondering – are you asking about my streaming
>> input source? I implemented a custom receiver and have been using that.
>> Thanks.
>>
>> From: Dimitris Kouzis - Loukas <lookfwd@gmail.com>
>> Date: Wednesday, August 5, 2015 at 5:27 PM
>> To: Heath Guo <heathguo@fb.com>
>> Cc: "user@spark.apache.org" <user@spark.apache.org>
>> Subject: Re: Pause Spark Streaming reading or sampling streaming data
>>
>> What driver do you use? Sounds like something you should do before the
>> driver...
>>
>> On Thu, Aug 6, 2015 at 12:50 AM, Heath Guo <heathguo@fb.com> wrote:
>>
>>> Hi, I have a question about sampling Spark Streaming data, or getting
>>> part of the data. For every minute, I only want the data read in during the
>>> first 10 seconds, and discard all data in the next 50 seconds. Is there any
>>> way to pause reading and discard data in that period? I'm doing this to
>>> sample from a stream of huge amount of data, which saves processing time in
>>> the real-time program. Thanks!
>>>
>>
>>
>

Mime
View raw message