spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sun Rui <sunrise_...@163.com>
Subject Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?
Date Wed, 24 Aug 2016 14:09:21 GMT
For HDFS, maybe you can try mount HDFS as NFS. But not sure about the stability, and also there
is additional overhead of network I/O and replica of HDFS files.
> On Aug 24, 2016, at 21:02, Saisai Shao <sai.sai.shao@gmail.com> wrote:
> 
> Spark Shuffle uses Java File related API to create local dirs and R/W data, so it can
only be worked with OS supported FS. It doesn't leverage Hadoop FileSystem API, so writing
to Hadoop compatible FS is not worked.
> 
> Also it is not suitable to write temporary shuffle data into distributed FS, this will
bring unnecessary overhead. In you case if you have large memory on each node, you could use
ramfs instead to store shuffle data.
> 
> Thanks
> Saisai
> 
> On Wed, Aug 24, 2016 at 8:11 PM, tony.yan@tendcloud.com <mailto:tony.yan@tendcloud.com>
<tony.yan@tendcloud.com <mailto:tony.yan@tendcloud.com>> wrote:
> Hi, All,
> When we run Spark on very large data, spark will do shuffle and the shuffle data will
write to local disk. Because we have limited capacity at local disk, the shuffled data will
occupied all of the local disk and then will be failed.  So is there a way we can write the
shuffle spill data to HDFS? Or if we introduce alluxio in our system, can the shuffled data
write to alluxio?
> 
> Thanks and Regards,
> 
> 阎志涛(Tony)
> 
> 北京腾云天下科技有限公司
> --------------------------------------------------------------------------------------------------------
> 邮箱:tony.yan@tendcloud.com <mailto:tony.yan@tendcloud.com>
> 电话:13911815695
> 微信: zhitao_yan
> QQ : 4707059
> 地址:北京市东城区东直门外大街39号院2号楼航空服务大厦602室
> 邮编:100027
> --------------------------------------------------------------------------------------------------------
> TalkingData.com <http://talkingdata.com/> - 让数据说话
> 


Mime
View raw message