phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankit Singhal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-2143) Use guidepost bytes instead of region name in stats primary key
Date Wed, 23 Dec 2015 09:04:46 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069406#comment-15069406
] 

Ankit Singhal commented on PHOENIX-2143:
----------------------------------------

bq. but I don't think you'll need to change GuidePostsInfo to List<GuidePostsInfo> as
GuidePostsInfo encapsulates all the guideposts across a table per column family. This object
is sent across the wire as part of the PTable (the metadata for a table) and we'll want to
continue doing that (the client caches the guideposts across the entire table). I don't think
this change will impact the client-side much (other than perhaps a few minor tweaks).

but if we still keep the GuidePostInfo per cf, then how we will distribute the rowcount and
bytecount to the guidePosts or still these metrics are needed at cf level?

bq. Removing any stats-related logic when a split occurs. Nothing will be required during
a split.

As we are keeping regionname as another column in stats, so we need to update the region name
for the quigeposts after split ,right?

bq. We previously could delete the row that stored all guideposts for a given table/region/cf,
but this will no longer be possible

We still can delete the rows by using the region column or you are suggesting that it is better
to just maintain consistency in the guideposts and not ensure their belonging to regions and
their boundaries?





> Use guidepost bytes instead of region name in stats primary key
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-2143
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2143
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: Ankit Singhal
>         Attachments: PHOENIX-2143_wip.patch
>
>
> Our current SYSTEM.STATS table uses the region name as the last column in the primary
key constraint. Instead, we should use the MIN_KEY column (which corresponds to the region
start key). The advantage would be that the stats would then be ordered by region start key
allowing us to approximate the number of guideposts which would be traversed given the start/stop
row of a scan:
> {code}
> SELECT SUM(guide_posts_count) FROM SYSTEM.STATS WHERE min_key > :1 AND min_key <
:2
> {code}
> where :1 is the start row and :2 is the stop row of the scan. With an UNNEST operator
for ARRAYs, we could get a better approximation.
> As part of the upgrade to the new Phoenix version containing this fix, stats could simply
be dropped and they'd be recalculated with the new schema.
> An alternative, even more granular approach would be to *not* use arrays to store the
guide posts, but instead store them as individual rows with a schema like this.
> |PHYSICAL_NAME|VARCHAR|
> |COLUMN_FAMILY|VARCHAR|
> |GUIDE_POST_KEY|VARBINARY|
> In this alternative, the maintenance during compaction is higher, though, as you'd need
to run a separate query to do the deletion of the old guideposts, followed by a commit of
the new guideposts. The other disadvantage (besides requiring multiple queries) is that this
couldn't be done transactionally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message