Most stream systems you're still going to incur the cost of reading
each message... I suppose you could rotate among reading just the
latest messages from a single partition of a Kafka topic if they were
But once you've read the messages, nothing's stopping you from
filtering most of them out before doing further processing. The
dstream .transform method will let you do any filtering / sampling you
could have done on an rdd.
On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <email@example.com> wrote:
> Hi all,
> I have to handle high-speed rate data stream. To reduce the heavy load, I
> want to use sampling techniques for each stream window. It means that I want
> to process a subset of data instead of whole window data. I saw Spark
> support sampling operations for RDD, but for DStream, Spark supports
> sampling operation as well? If not, could you please give me a suggestion
> how to implement it?