spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Teemu Heikkilä <te...@emblica.fi>
Subject Re: Measuring cluster utilization of a streaming job
Date Tue, 14 Nov 2017 13:02:01 GMT
Without knowing anything about your pipeline the best estimate of the resources needed is to
run the job with same ingestion rate as the normal production load.

With kafka you can enable back pressure so with high load also your latency will just increase
but you don’t have to have capacity for handling the spikes. If you want you can then ie.
autoscale the cluster to respond for the load.

If you are using Yarn you can isolate and limit some resources so you can also run other workloads
in same cluster if you need to have lots of elasticity. 

Usually with streaming jobs the concerns are not with computing capacity but more with network
bandwidth and memory consumption.


> On 14.11.2017, at 14.54, Nadeem Lalani <nadeemajl@gmail.com> wrote:
> 
> Hi,
> 
> I was wondering if anyone has done some work around measuring the cluster resource utilization
of a "typical" spark streaming job.
> 
> We are trying to build a message ingestion system which will read from Kafka and do some
processing.  We have had some concerns raised in the team that a 24*7 streaming job might
not be the best use of cluster resources especially when our use cases are to process data
in a micro batch fashion and are not truly streaming.
> 
> We wanted to measure  as to how much resource does a spark streaming process take. Any
pointers on where one would start?
> 
> We are on Yarn and plan to use spark 2.1
> 
> Thanks in advance,
> Nadeem 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message