On Tue, Mar 28, 2017 at 8:24 AM, Jan Holmberg <jan.holmberg@perigeum.fi> wrote:
I'm wondering the reason, why simple Spark prog. reading streaming data from Kafka and writing result to Kudu, has unpredictable write times. In most cases, when running the prog, write times are systematically 4 sec regardless of the number of messages (anything from 50 to 2000 messages per batch). But occasionally when starting the prog, it runs substantially faster where write times are below 0,5 sec with exactly same code base, settings etc.

How are you measuring "write times" here? Are you sure the time is being spent in the Kudu code and not in other parts of the streaming app?

Writing 2000 rows to Kudu should be on the order of a few milliseconds -- even 0.5 seconds sounds extremely high.

Are you by chance instantiating a new KuduClient each time you write a batch, rather than reusing an existing one?

Our environment is plain AWS cluster with 3 slaves where each slave has Kafka and Kudu tablet server instance with CDH 5.10 & Kudu 1.2  & Spark 1.6.

Any hints what to look at?


Todd Lipcon
Software Engineer, Cloudera