flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TechnoMage <mla...@technomage.com>
Subject Re: Flink/Kafka POC performance issue
Date Tue, 17 Apr 2018 15:50:33 GMT
Memory use is steady throughout the job, but the CPU utilization drops off a cliff.  I assume
this is because it becomes I/O bound shuffling managed state.

Are there any metrics on managed state that can help in evaluating what to do next?


> On Apr 17, 2018, at 7:11 AM, Michael Latta <mlatta@technomage.com> wrote:
> Thanks for the suggestion. The task manager is configured for 8GB of heap, and gets to
about 8.3 total. Other java processes (job manager and Kafka). Add a few more. I will check
it again but the instances have 16GB same as my laptop that completes the test in <90 min.

> Michael
> Sent from my iPad
> On Apr 16, 2018, at 10:53 PM, Niclas Hedhman <niclas@hedhman.org <mailto:niclas@hedhman.org>>
>> Have you checked memory usage? It could be as simple as either having memory leaks,
or aggregating more than you think (sometimes not obvious how much is kept around in memory
for longer than one first thinks). If possible, connect FlightRecorder or similar tool and
keep an eye on memory. Additionally, I don't have AWS experience to talk of, but IF AWS swaps
RAM to disk like regular Linux, then that might be triggered if your JVM heap is bigger than
can be handled within the available RAM.
>> On Tue, Apr 17, 2018 at 9:26 AM, TechnoMage <mlatta@technomage.com <mailto:mlatta@technomage.com>>
>> I am doing a short Proof of Concept for using Flink and Kafka in our product.  On
my laptop I can process 10M inputs in about 90 min.  On 2 different EC2 instances (m4.xlarge
and m5.xlarge both 4core 16GB ram and ssd storage) I see the process hit a wall around 50min
into the test and short of 7M events processed.  This is running zookeeper, kafka broker,
flink all on the same server in all cases.  My goal is to measure single node vs. multi-node
and test horizontal scalability, but I would like to figure out why hit hits a wall first.
 I have the task maanger configured with 6 slots and the job has 5 parallelism.  The laptop
has 8 threads, and the EC2 instances have 4 threads. On smaller data sets and in the begining
of each test the EC2 instances outpace the laptop.  I will try again with an m5.2xlarge which
has 8 threads and 32GB ram to see if that works better for this workload.  Any pointers or
ways to get metrics that would help diagnose this would be appreciated.
>> Michael
>> -- 
>> Niclas Hedhman, Software Developer
>> http://polygene.apache.org <http://polygene.apache.org/> - New Energy for Java

View raw message