Dear All,

I recently installed Spark 1.0.0 on a 10-slave dedicate cluster. However, the max input rate that the system can sustain with stable latency seems very low. I use a simple word counting workload over tweets:

theDStream.flatMap(extractWordOnePairs).reduceByKey(sumFunc).count.print

With 2s batch interval, the 10-slave cluster can only handle ~ 30,000 tweets/s (which translates to ~ 300,000 words/s). To give you a sense about the speed of a slave machine,  a single machine can handle ~ 100,000 tweets/s on a stream processing program in plain java. 

I've tuned the following parameters without seeing obvious improvement:
1. Batch interval: 1s, 2s, 5s, 10s
2. Parallelism: 1 x total num of cores, 2x, 3x 
3. StorageLevel: MEMORY_ONLY, MEMORY_ONLY_SER
4. Run type: yarn-client, standalone cluster

* My first question is: what are the max input rates you have observed on Spark Streaming? I know it depends on the workload and the hardware. But I just want to get some sense of the reasonable numbers.

* My second question is: any suggestion on what I can tune to improve the performance? I've found unexpected delays in "reduce" that I can't explain, and they may be related to the poor performance. Details are shown below

============= DETAILS =============

Below is the CPU utilization plot with 2s batch interval and 40,000 tweets/s. The latency keeps increasing while the CPU, network, disk and memory are all under utilized.


I tried to find out which stage is the bottleneck. It seems that the "reduce" phase for each batch can usually finish in less than 0.5s, but sometimes (70 out of 545 batches) takes 5s. Below is a snapshot of the wet UI showing the time taken by "reduce" in some batches where the normal cases are marked in green and the abnormal case is marked in red:


I further look into all the tasks of a slow "reduce" stage. As shown by the below snapshot, a small portion of the tasks are stragglers:


Here is the log of some slow "reduce" tasks on an executor, where the start and end of the tasks are marked in red. They started at 21:55:43, and completed at 21:55:48. During the 5s, I can only see shuffling at the beginning and activities of input blocks.


...

For comparison, here is the log of the normal "reduce" tasks on the same executor:


It seems that these slow "reduce" stages keep latency increasing. However, I can't explain what causes the straggler tasks in the "reduce" stages. Anybody has any idea?

Thanks.