spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wang, Gang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-13004) Support Non-Volatile Data and Operations
Date Tue, 26 Jan 2016 20:09:39 GMT

     [ https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wang, Gang updated SPARK-13004:
-------------------------------
    Description: 
Based on our experiments, the SerDe-like operations have some significant negative performance
impacts on majority of industrial Spark workloads, especially, when the volumn of datasets
are much larger than the system memory volumns of Spark cluster available to caching, checkpoint,
shuffling/dispatching, data loading and Storing. the JVM on-heap management would downgrade
the performance as well when under pressure incurred by large memory demand and frequently
memory allocation/free operations.

With the trend of adopting advanced server platform technologies e.g. Large Memory Server,
Non-volatile Memory and NVMe/Fast SSD Array Storage, This project focuses on adopting new
features provided by server platform for Spark applications and retrofitting the utilization
of hybrid addressable memory resources onto Spark whenever possible.

*Data Object Managment*

  * Using our non-volatile generic object programming model (NVGOP) to avoid SerDe as well
as reduce GC overhead.
  * Minimizing memory footprint to load data lazily.
  * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
  * Using non-volatile/off-heap RDDs to transform Spark datasets.
  * Avoiding the memory caching part by the way of in-place non-volatile RDD operations.
  * Avoiding the checkpoints for Spark computing.

*Data Memory Management*
  
  * Managing hereogeneous memory devices as an unified hybrid memory cache pool for Spark.
  * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
  * Supporting to Reclaim allocated memory blocks automatically.
  * Providing an unified memory block APIs for the general purpose of memory usage.
  
*Computing device management*

  * AVX instructions, programmable FPGA and GPU.
  

Our customized Spark prototype has shown some potential improvements.
[https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
!http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
!http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
  
This epic tries to further improve the Spark performance with our non-volatile solutions.




  was:
Based on our experiments, the SerDe-like operations have some significant negative performance
impacts on majority of industrial Spark workloads, especially, when the volumn of datasets
are much larger than the system memory volumns of Spark cluster available to caching, checkpoint,
shuffling/dispatching, data loading and Storing. the JVM on-heap management would downgrade
the performance as well when under pressure incurred by large memory demand and frequently
memory allocation/free operations.

With the trend of adopting advanced server platform technologies e.g. Large Memory Server,
Non-volatile Memory and NVMe/Fast SSD Array Storage, This project focuses on adopting new
features provided by server platform for Spark applications and retrofitting the utilization
of hybrid addressable memory resources onto Spark whenever possible.

*Data Object Managment*

  * Using our non-volatile generic object programming model (NVGOP) to avoid SerDe as well
as reduce GC overhead.
  * Minimizing memory footprint to load data lazily.
  * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
  * Using non-volatile/off-heap RDDs to transform Spark datasets.
  * Avoiding the memory caching part by the way of in-place non-volatile RDD operations.
  * Avoiding the checkpoints for Spark computing.

*Data Memory Management*
  
  * Managing hereogeneous memory devices as an unified hybrid memory cache pool for Spark.
  * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
  * Supporting to Reclaim allocated memory blocks automatically.
  * Providing an unified memory block APIs for the general purpose of memory usage.
  
*Computing device management*

  * AVX instructions, programmable FPGA and GPU.
  

Our customized Spark prototype has shown some potential improvements.
[https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
!http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
!http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300, align=right!
  
This epic tries to further improve the Spark performance with our non-volatile solutions.





> Support Non-Volatile Data and Operations
> ----------------------------------------
>
>                 Key: SPARK-13004
>                 URL: https://issues.apache.org/jira/browse/SPARK-13004
>             Project: Spark
>          Issue Type: Epic
>          Components: Input/Output, Spark Core
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Wang, Gang
>              Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant negative performance
impacts on majority of industrial Spark workloads, especially, when the volumn of datasets
are much larger than the system memory volumns of Spark cluster available to caching, checkpoint,
shuffling/dispatching, data loading and Storing. the JVM on-heap management would downgrade
the performance as well when under pressure incurred by large memory demand and frequently
memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large Memory Server,
Non-volatile Memory and NVMe/Fast SSD Array Storage, This project focuses on adopting new
features provided by server platform for Spark applications and retrofitting the utilization
of hybrid addressable memory resources onto Spark whenever possible.
> *Data Object Managment*
>   * Using our non-volatile generic object programming model (NVGOP) to avoid SerDe as
well as reduce GC overhead.
>   * Minimizing memory footprint to load data lazily.
>   * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
>   * Using non-volatile/off-heap RDDs to transform Spark datasets.
>   * Avoiding the memory caching part by the way of in-place non-volatile RDD operations.
>   * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>   
>   * Managing hereogeneous memory devices as an unified hybrid memory cache pool for Spark.
>   * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
>   * Supporting to Reclaim allocated memory blocks automatically.
>   * Providing an unified memory block APIs for the general purpose of memory usage.
>   
> *Computing device management*
>   * AVX instructions, programmable FPGA and GPU.
>   
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>   
> This epic tries to further improve the Spark performance with our non-volatile solutions.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message