spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artemis User <arte...@dtechspace.com>
Subject Re: How to Scale Streaming Application to Multiple Workers
Date Fri, 16 Oct 2020 19:50:46 GMT
We can't use AWS since the target production has to be on-prem. The 
reason we choose Spark is because of its ML libraries.  Lambda would be 
a great model for stream processing from a functional programming 
perspective.  Not sure how well can it be integrated with Spark ML or 
other ML libraries.  Any suggestions would be highly appreciated..

ND

On 10/16/20 2:49 PM, Lalwani, Jayesh wrote:
>
> With a file based source, Spark is going to take maximum use of memory 
> before it tries to scaling to more nodes. Parallelization adds 
> overhead. This overhead is negligible if your data is several gigs or 
> above. If your entire data can fit into memory of one node, then it’s 
> better to process everything in one node. Forcing Spark to parallelize 
> processing that can be done in a single node will reduce throughput.
>
> You are right, though. Spark is overkill for a simple transformation 
> for a 300KB file. A lot of people implement simple transformations 
> using serverless AWS Lambda. Spark’s power comes in when you are 
> joining streaming sources and/or joining streaming sources with batch 
> sources. It’s not that Spark can’t do simple transformations. It’s 
> perfectly capable of doing it. It make sense to implement simple 
> transformations in Spark if you have a data pipeline that is 
> implemented in Spark, and this ingestion is one of many other things 
> that you do with Spark. But, if your entire pipeline consists of 
> ingestion of small files, then you might be better off with simpler 
> solutions.
>
> *From: *Artemis User <artemis@dtechspace.com>
> *Date: *Friday, October 16, 2020 at 2:19 PM
> *Cc: *user <user@spark.apache.org>
> *Subject: *RE: [EXTERNAL] How to Scale Streaming Application to 
> Multiple Workers
>
> *CAUTION*: This email originated from outside of the organization. Do 
> not click links or open attachments unless you can confirm the sender 
> and know the content is safe.
>
> Thank you all for the responses.  Basically we were dealing with file 
> source (not Kafka, therefore no topics involved) and dumping csv files 
> (about 1000 lines, 300KB per file) at a pretty high speed (10 - 15 
> files/second) one at a time to the stream source directory.  We have a 
> Spark 3.0.1. cluster configured with 4 workers, each one is allocated 
> with 4 cores.  We tried numerous options, including setting the 
> spark.streaming.dynamicAllocation.enabled parameter to true, and 
> setting the maxFilesPerTrigger to 1, but were unable to scale the 
> #cores*#workers >4.
>
> What I am trying to understand is that what makes spark to allocate 
> jobs to more workers?  Is it based on the size of the data frame, 
> batch sizes or trigger intervals?  Looks like the Spark master 
> scheduler doesn't consider the number of input files waiting to be 
> processed, only consider the data size (i.e. the size of data frames) 
> that has been read or already imported, before allocating new 
> workers.  If that that case, then Spark really missed the point and 
> wasn't really designed for real-time streaming applications.  I could 
> write my own stream processor that would distribute the load based on 
> the number of input files, given the fact, that each batch query is 
> atomic/independent from each other..
>
> Thanks in advance for your comment/input.
>
> ND
>
> On 10/15/20 7:13 PM, muru wrote:
>
>     File streaming in SS, you can try setting "maxFilesPerTrigger" per
>     batch. The forEachBatch is an action, the output is written to
>     various sinks. Are you doing any post transformation in forEachBatch?
>
>     On Thu, Oct 15, 2020 at 1:24 PM Mich Talebzadeh
>     <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>> wrote:
>
>         Hi,
>
>         This in general depends on how many topics you want to process
>         at the same time and whether this is done on-premise running
>         Spark in cluster mode.
>
>         Have you looked at Spark GUI to see if one worker (one JVM) is
>         adequate for the task?
>
>         Also how these small files are read and processed. Is it the
>         same data microbatched? Spark streaming does not process one
>         event at a time which is in general I think what people call
>         "Streaming." It instead processes groups of events. Each group
>         is a "MicroBatch" that gets processed at the same time.
>
>
>         What parameters (BatchInterval, WindowsLength,SlidingInterval)
>         are you using?
>
>         Parallelism helps when you have reasonably large data and your
>         cores are running on different sections of data in parallel. 
>         Roughly how much do you have in every CSV file
>
>         HTH,
>
>         Mich
>
>         LinkedIn
>         ///https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
>
>         *Disclaimer:* Use it at your own risk.Any and all
>         responsibility for any loss, damage or destruction of data or
>         any other property which may arise from relying on this
>         email's technical content is explicitly disclaimed. The author
>         will in no case be liable for any monetary damages arising
>         from such loss, damage or destruction.
>
>         On Thu, 15 Oct 2020 at 20:02, Artemis User
>         <artemis@dtechspace.com <mailto:artemis@dtechspace.com>> wrote:
>
>             Thanks for the input.  What I am interested is how to have
>             multiple
>             workers to read and process the small files in parallel,
>             and certainly
>             one file per worker at a time.  Partitioning data frame
>             doesn't make
>             sense since the data frame is small already.
>
>             On 10/15/20 9:14 AM, Lalwani, Jayesh wrote:
>             > Parallelism of streaming depends on the input source. If
>             you are getting one small file per microbatch, then Spark
>             will read it in one worker. You can always repartition
>             your data frame after reading it to increase the parallelism.
>             >
>             > On 10/14/20, 11:26 PM, "Artemis User"
>             <artemis@dtechspace.com <mailto:artemis@dtechspace.com>>
>             wrote:
>             >
>             >      CAUTION: This email originated from outside of the
>             organization. Do not click links or open attachments
>             unless you can confirm the sender and know the content is
>             safe.
>             >
>             >
>             >
>             >      Hi,
>             >
>             >      We have a streaming application that read
>             microbatch csv files and
>             >      involves the foreachBatch call.  Each microbatch
>             can be processed
>             >      independently.  I noticed that only one worker node
>             is being utilized.
>             >      Is there anyway or any explicit method to
>             distribute the batch work load
>             >      to multiple workers?  I would think Spark would
>             execute foreachBatch
>             >      method on different workers since each batch can be
>             treated as atomic?
>             >
>             >      Thanks!
>             >
>             >      ND
>             >
>             >
>             >
>             ---------------------------------------------------------------------
>             >      To unsubscribe e-mail:
>             user-unsubscribe@spark.apache.org
>             <mailto:user-unsubscribe@spark.apache.org>
>             >
>             >
>
>             ---------------------------------------------------------------------
>             To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>             <mailto:user-unsubscribe@spark.apache.org>
>

Mime
View raw message