spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artemis User <>
Subject Re: How to Scale Streaming Application to Multiple Workers
Date Thu, 15 Oct 2020 19:01:03 GMT
Thanks for the input.  What I am interested is how to have multiple 
workers to read and process the small files in parallel, and certainly 
one file per worker at a time.  Partitioning data frame doesn't make 
sense since the data frame is small already.

On 10/15/20 9:14 AM, Lalwani, Jayesh wrote:
> Parallelism of streaming depends on the input source. If you are getting one small file
per microbatch, then Spark will read it in one worker. You can always repartition your data
frame after reading it to increase the parallelism.
> On 10/14/20, 11:26 PM, "Artemis User" <> wrote:
>      CAUTION: This email originated from outside of the organization. Do not click links
or open attachments unless you can confirm the sender and know the content is safe.
>      Hi,
>      We have a streaming application that read microbatch csv files and
>      involves the foreachBatch call.  Each microbatch can be processed
>      independently.  I noticed that only one worker node is being utilized.
>      Is there anyway or any explicit method to distribute the batch work load
>      to multiple workers?  I would think Spark would execute foreachBatch
>      method on different workers since each batch can be treated as atomic?
>      Thanks!
>      ND
>      ---------------------------------------------------------------------
>      To unsubscribe e-mail:

To unsubscribe e-mail:

View raw message