spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artemis User <arte...@dtechspace.com>
Subject Re: How to Scale Streaming Application to Multiple Workers
Date Fri, 16 Oct 2020 18:17:20 GMT
Thank you all for the responses.  Basically we were dealing with file 
source (not Kafka, therefore no topics involved) and dumping csv files 
(about 1000 lines, 300KB per file) at a pretty high speed (10 - 15 
files/second) one at a time to the stream source directory.  We have a 
Spark 3.0.1. cluster configured with 4 workers, each one is allocated 
with 4 cores.  We tried numerous options, including setting the 
spark.streaming.dynamicAllocation.enabled parameter to true, and setting 
the maxFilesPerTrigger to 1, but were unable to scale the 
#cores*#workers >4.

What I am trying to understand is that what makes spark to allocate jobs 
to more workers?  Is it based on the size of the data frame, batch sizes 
or trigger intervals?  Looks like the Spark master scheduler doesn't 
consider the number of input files waiting to be processed, only 
consider the data size (i.e. the size of data frames) that has been read 
or already imported, before allocating new workers.  If that that case, 
then Spark really missed the point and wasn't really designed for 
real-time streaming applications.  I could write my own stream processor 
that would distribute the load based on the number of input files, given 
the fact, that each batch query is atomic/independent from each other..

Thanks in advance for your comment/input.

ND

On 10/15/20 7:13 PM, muru wrote:
> File streaming in SS, you can try setting "maxFilesPerTrigger" per 
> batch. The forEachBatch is an action, the output is written to various 
> sinks. Are you doing any post transformation in forEachBatch?
>
> On Thu, Oct 15, 2020 at 1:24 PM Mich Talebzadeh 
> <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>> wrote:
>
>     Hi,
>
>     This in general depends on how many topics you want to process at
>     the same time and whether this is done on-premise running Spark in
>     cluster mode.
>
>     Have you looked at Spark GUI to see if one worker (one JVM) is
>     adequate for the task?
>
>     Also how these small files are read and processed. Is it the same
>     data microbatched? Spark streaming does not process one event at a
>     time which is in general I think what people call "Streaming." It
>     instead processes groups of events. Each group is a "MicroBatch"
>     that gets processed at the same time.
>
>
>     What parameters (BatchInterval, WindowsLength,SlidingInterval) are
>     you using?
>
>
>     Parallelism helps when you have reasonably large data and your
>     cores are running on different sections of data in parallel. 
>     Roughly how much do you have in every CSV file
>
>
>     HTH,
>
>
>     Mich
>
>
>     LinkedIn
>     /https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
>
>
>
>     *Disclaimer:* Use it at your own risk.Any and all responsibility
>     for any loss, damage or destruction of data or any other property
>     which may arise from relying on this email's technical content is
>     explicitly disclaimed. The author will in no case be liable for
>     any monetary damages arising from such loss, damage or destruction.
>
>
>
>     On Thu, 15 Oct 2020 at 20:02, Artemis User <artemis@dtechspace.com
>     <mailto:artemis@dtechspace.com>> wrote:
>
>         Thanks for the input. What I am interested is how to have
>         multiple
>         workers to read and process the small files in parallel, and
>         certainly
>         one file per worker at a time.  Partitioning data frame
>         doesn't make
>         sense since the data frame is small already.
>
>         On 10/15/20 9:14 AM, Lalwani, Jayesh wrote:
>         > Parallelism of streaming depends on the input source. If you
>         are getting one small file per microbatch, then Spark will
>         read it in one worker. You can always repartition your data
>         frame after reading it to increase the parallelism.
>         >
>         > On 10/14/20, 11:26 PM, "Artemis User"
>         <artemis@dtechspace.com <mailto:artemis@dtechspace.com>> wrote:
>         >
>         >      CAUTION: This email originated from outside of the
>         organization. Do not click links or open attachments unless
>         you can confirm the sender and know the content is safe.
>         >
>         >
>         >
>         >      Hi,
>         >
>         >      We have a streaming application that read microbatch
>         csv files and
>         >      involves the foreachBatch call.  Each microbatch can be
>         processed
>         >      independently.  I noticed that only one worker node is
>         being utilized.
>         >      Is there anyway or any explicit method to distribute
>         the batch work load
>         >      to multiple workers?  I would think Spark would execute
>         foreachBatch
>         >      method on different workers since each batch can be
>         treated as atomic?
>         >
>         >      Thanks!
>         >
>         >      ND
>         >
>         >
>         >
>         ---------------------------------------------------------------------
>         >      To unsubscribe e-mail:
>         user-unsubscribe@spark.apache.org
>         <mailto:user-unsubscribe@spark.apache.org>
>         >
>         >
>
>         ---------------------------------------------------------------------
>         To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>         <mailto:user-unsubscribe@spark.apache.org>
>

Mime
View raw message