spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Davidson (JIRA)" <>
Subject [jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
Date Fri, 12 Dec 2014 20:15:13 GMT


Aaron Davidson commented on SPARK-4740:

The thing is, the decision about which IO requests to make is done at a higher level -- individual
blocks are requested by the ShuffleBlockFetcherIterator. So it is not changed based on Netty
vs NIO. The only difference that i can think of is the concurrency of the requests -- in Netty
we only make 1 concurrent request per machine in the cluster (so in the 4-node case, 3 concurrent
requests to disk), while in NIO we saw 20 concurrent threads reading from disk in the same
environment. It's possible that the disk controllers handle the increased request parallelism
by correctly merging IO requests at the hardware level.

To this end, we added the numConnectionsPerPeer option in order to allow users to make multiple
concurrent requests per machine. However, testing on the HDD cluster did not seem to reveal
that the problem was resolved. I would be curious what the results would be for the powerful
CPU test with numConnectionsPerPeer set to around 6 (which should enable up to 18 concurrent
requests from disk, similar to the observed NIO 20).

Another possibility is that while the requests in both the Netty and NIO cases are the same,
we receive all the IO requests from a single reduce task before moving onto the next. Two
reduce tasks are going to read through all the local files, so if we asked the disk to read,
say, 10 KB from 200 files, and subsequently did another sweep of 10 KB from the same 200 files,
this could be less efficient than if we interleaved the requests such that we ask for 10 KB
from file0 for reduce0 then 10 KB from file0 for reduce1, then moved on to file1. It is possible
that somehow NIO is interleaving these requests more naturally than Netty, though this problem
is not fundamental to either system and is probably a happy coincidence if it is the case.
We could try, for instance, handling the response to block requests on a different event loop
than receiving the requests, and randomly choose from any buffered requests to serve IO.

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> ------------------------------------------------------------------------
>                 Key: SPARK-4740
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Zhang, Liye
>            Assignee: Reynold Xin
>         Attachments: (rxin patch better executor)TestRunner  sort-by-key - Thread dump
for executor, (rxin patch normal executor)TestRunner  sort-by-key - Thread dump
for executor 0, Spark-perf Test Report 16 Cores per Executor.pdf, Spark-perf Test
Report.pdf, TestRunner  sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per
node).zip, TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per node).zip,
> When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key,
etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService.
The network throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 billion records,
and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G
NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set
to 1000. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message