spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yucai (JIRA)" <>
Subject [jira] [Commented] (SPARK-12196) Store/retrieve blocks in different speed storage devices by hierarchy way
Date Tue, 05 Jan 2016 02:21:39 GMT


yucai commented on SPARK-12196:

Hi wei, I think maybe both item could be implemented in separate PR, as for item 2, it could
be great if you can can contribute :).

> Store/retrieve blocks in different speed storage devices by hierarchy way
> -------------------------------------------------------------------------
>                 Key: SPARK-12196
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: yucai
> *Motivation*
> Nowadays, customers have both SSDs(SATA SSD/PCIe SSD) and HDDs. 
> SSDs have great performance, but capacity is small. 
> HDDs have good capacity, but much slower than SSDs(x2-x3 slower than SATA SSD, x20 slower
than PCIe SSD).
> How can we get both good?
> *Proposal*
> One solution is to build hierarchy store: use SSDs as cache and HDDs as backup storage.

> When Spark core allocates blocks (either for shuffle or RDD cache), it gets blocks from
SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from
> In our implementation, we actually go further. We support a way to build any level hierarchy
store access various storage medias (MEM, NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be
higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support
both RDD cache and shuffle and no extra inter process communication.
> *Test Environment*
> 1. 4 IVB box(40 cores, 192GB memory, 10GB Nic, 11HDDs/11SATA SSDs/PCIE SSD) 
> 2. Real customer case NWeight(graph analysis), which is to compute associations between
two vertices that are n-hop away(e.g., friend-to-friend or video-to-video relationship for
> 3. Data Size: 22GB, Vertices: 41 milion, Edges: 1.4 billion.
> *Usage*
> 1. Set the priority and threshold for each layer in
> {code}
>'nvm 40GB,ssd 20GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest
form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like "nvm", "ssd",
which are specified in step 1, into local dirs, like spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm first.
> When nvm's usable space is less than 40GB, it starts to allocate from ssd.
> When ssd's usable space is less than 20GB, it starts to allocate from the last layer.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message