giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hassan Eslami (JIRA)" <>
Subject [jira] [Commented] (GIRAPH-1073) Decouple out-of-core persistence infrastructure from out-of-core computation
Date Wed, 15 Jun 2016 18:28:09 GMT


Hassan Eslami commented on GIRAPH-1073:

> Decouple out-of-core persistence infrastructure from out-of-core computation
> ----------------------------------------------------------------------------
>                 Key: GIRAPH-1073
>                 URL:
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Hassan Eslami
>            Assignee: Hassan Eslami
> In the current out-of-core infrastructure, the persistence layer is heavily intertwined
with the scheduling and out-of-core engine. This makes it complicated to try new features
for the persistence layer. The following changes are needed:
>  * The persistence layer should be decoupled from out-of-core infrastructure. This way
one can simply implement and plug different data accessors for various persistence resources,
e.g. local file system data accessor, HDFS data accessor, serialized in-memory data accessor,
>  * We should be able to address out-of-core data in a more efficient and flexible way.
Currently, data are accessed/addressed through string literals in various locations of the
code. This should be changed so data can be accessed through a unified, more flexible data
indexing mechanism.
>  * With different implementations of data accessor, now there may be more emphasis on
having more IO threads. It is important that these IO threads are load-balanced. Currently,
partitions are assigned to IO threads using a hash function. Hash function tent not to balance
load with small number of data points (partitions in this case).
>  * Currently, out-of-core uses `BufferedInputStream` and `BufferedOutputStream` along
with the default (de)serialization mechanism. The IO bandwidth achieved in the current implementation
is low. One can simply use: 1) Unsafe (de)serialization mechanism to optimize for memory bandwidth
during (de)serialization process, 2) RandomAccessFile's read and write interface to have lower
level access to the local file system and avoid overheads in reading/writing from/to local

This message was sent by Atlassian JIRA

View raw message