Hi Cody and all,

Thank you for your answer. I implement simple random sampling (SRS) for DStream using transform method, and it works fine.
However, I have a problem when I implement reservoir sampling (RS). In RS, I need to maintain a reservoir (a queue) to store selected data items (RDDs). If I define a large stream window, the queue also increases  and it leads to the driver run out of memory.  I explain my problem in detail here: https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok

Could you please give me some suggestions or advice to fix this problem?


On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <cody@koeninger.org> wrote:
Most stream systems you're still going to incur the cost of reading
each message... I suppose you could rotate among reading just the
latest messages from a single partition of a Kafka topic if they were
evenly balanced.

But once you've read the messages, nothing's stopping you from
filtering most of them out before doing further processing.  The
dstream .transform method will let you do any filtering / sampling you
could have done on an rdd.

On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <martin.lequoc@gmail.com> wrote:
> Hi all,
> I have to handle high-speed rate data stream. To reduce the heavy load, I
> want to use sampling techniques for each stream window. It means that I want
> to process a subset of data instead of whole window data. I saw Spark
> support sampling operations for RDD, but for DStream, Spark supports
> sampling operation as well? If not,  could you please give me a suggestion
> how to implement it?
> Thanks,
> Martin