spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Is "spark streaming" streaming or mini-batch?
Date Wed, 24 Aug 2016 09:40:45 GMT

On 23 Aug 2016, at 17:58, Mich Talebzadeh <mich.talebzadeh@gmail.com<mailto:mich.talebzadeh@gmail.com>>
wrote:

In general depending what you are doing you can tighten above parameters. For example if you
are using Spark Streaming for Anti-fraud detection, you may stream data in at 2 seconds batch
interval, Keep your windows length at 4 seconds and your sliding intervall = 2 seconds which
gives you a kind of tight streaming. You are aggregating data that you are collecting over
the  batch Window.

I should warn that in https://github.com/apache/spark/pull/14731 I've been trying to speed
up input scanning against object stores, and collecting numbers on the way

*if you are using the FileInputDStream to scan s3, azure (and persumably gcs) object stores
for data, the time to scan a moderately complex directory tree is going to be measurable in
seconds*

It's going to depend on distance from the object store and number of files, but you'll probably
need to use a bigger window

(that patch for SPARK-17159 should improve things ... I'd love some people to help by testing
it or emailing me direct with any (anonymised) list of what their directory structures used
in object store FileInputDStream streams that I could regenerate for inclusion in some performance
tests.



Mime
View raw message