chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CHUKWA-667) Optimize the HBase schema for Ganglia queris
Date Sun, 01 Feb 2015 10:50:34 GMT

    [ https://issues.apache.org/jira/browse/CHUKWA-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14300152#comment-14300152
] 

Eric Yang commented on CHUKWA-667:
----------------------------------

Hi Sreepathi,

In general, column family is partitioned per directory in HDFS.  The common access pattern
is by column instead of by row.  Therefore, using more than 1 column family is fine as long
as the scan is from the same column family.  there is no performance penalty.  However, using
secondary table to store the metric name has it's own problems.  The ID needs to be padded.
 Otherwise it is possible to get "|1000" from composed query of "|100", if the keys are not
padded with the same length.  When storing large number of key types, padding only take slightly
less storage than direct store of metric name in cell name instead.  However the lookup time
is faster to have shorter row key to locate region, and linearly deserializing data from the
same data blocks, 2 connection requests to decode ID then linear scan of row key to find the
closest row key to start.  Linear row key scan is slower than grabbing block of data from
a column family for a row.

> Optimize the HBase schema for Ganglia queris
> --------------------------------------------
>
>                 Key: CHUKWA-667
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-667
>             Project: Chukwa
>          Issue Type: Sub-task
>          Components: Data Processors
>    Affects Versions: 0.6.0
>            Reporter: Saisai Shao
>
> Chukwa HBase table schema is designed for HICC, it cannot be fully adapted to Ganglia
web frontend for several reasons:
> (1) cannot fastly retrieve all the cluster and related host names.
> (2) system metrics have no attributes, like type, unit, so it is hard to explain the
collected metrics by code.
> (3) lack of data cosolidate function, choosing metric for a large time range (like 30
days) will fetch all the data and draw graph, which will largely lose performance.
> We will redesign the table schema that will be better adapted to Ganglia web frontend
queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message