spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Le <martin.leq...@gmail.com>
Subject Re: sampling operation for DStream
Date Mon, 01 Aug 2016 20:22:10 GMT
How to do that? if I put the queue inside .transform operation, it
doesn't work.

On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <cody@koeninger.org> wrote:

> Can you keep a queue per executor in memory?
>
> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <martin.lequoc@gmail.com>
> wrote:
> > Hi Cody and all,
> >
> > Thank you for your answer. I implement simple random sampling (SRS) for
> > DStream using transform method, and it works fine.
> > However, I have a problem when I implement reservoir sampling (RS). In
> RS, I
> > need to maintain a reservoir (a queue) to store selected data items
> (RDDs).
> > If I define a large stream window, the queue also increases  and it
> leads to
> > the driver run out of memory.  I explain my problem in detail here:
> >
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
> >
> > Could you please give me some suggestions or advice to fix this problem?
> >
> > Thanks
> >
> > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <cody@koeninger.org>
> wrote:
> >>
> >> Most stream systems you're still going to incur the cost of reading
> >> each message... I suppose you could rotate among reading just the
> >> latest messages from a single partition of a Kafka topic if they were
> >> evenly balanced.
> >>
> >> But once you've read the messages, nothing's stopping you from
> >> filtering most of them out before doing further processing.  The
> >> dstream .transform method will let you do any filtering / sampling you
> >> could have done on an rdd.
> >>
> >> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <martin.lequoc@gmail.com>
> >> wrote:
> >> > Hi all,
> >> >
> >> > I have to handle high-speed rate data stream. To reduce the heavy
> load,
> >> > I
> >> > want to use sampling techniques for each stream window. It means that
> I
> >> > want
> >> > to process a subset of data instead of whole window data. I saw Spark
> >> > support sampling operations for RDD, but for DStream, Spark supports
> >> > sampling operation as well? If not,  could you please give me a
> >> > suggestion
> >> > how to implement it?
> >> >
> >> > Thanks,
> >> > Martin
> >
> >
>

Mime
View raw message