phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-180) Use stats to guide query parallelization
Date Thu, 25 Sep 2014 19:03:33 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148149#comment-14148149
] 

Hudson commented on PHOENIX-180:
--------------------------------

SUCCESS: Integrated in Phoenix | 3.0 | Hadoop1 #227 (See [https://builds.apache.org/job/Phoenix-3.0-hadoop1/227/])
PHOENIX-180 Use stats to guide query parallelization (remove mistakenly checked-in files)
(maryannxue: rev b4811ad7f67cfee027692ee27a503898cd75fdcf)
* phoenix-core/src/main/java/org/apache/phoenix/schema/stat/PTableStatsImpl.java.orig
* phoenix-core/src/main/java/org/apache/phoenix/schema/stat/PTableStatsImpl.java.rej


> Use stats to guide query parallelization
> ----------------------------------------
>
>                 Key: PHOENIX-180
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-180
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: ramkrishna.s.vasudevan
>              Labels: enhancement
>             Fix For: 5.0.0, 4.2, 3.2
>
>         Attachments: Phoenix-180_3.0.patch, Phoenix-180_V1.patch, Phoenix-180_V2.patch,
Phoenix-180_WIP.patch, Phoenix-180_v3.patch, Phoenix-180_v5.patch
>
>
> We're currently not using stats, beyond a table-wide min key/max key cached per client
connection, to guide parallelization. If a query targets just a few regions, we don't know
how to evenly divide the work among threads, because we don't know the data distribution.
This other [issue] (https://github.com/forcedotcom/phoenix/issues/64) is targeting gather
and maintaining the stats, while this issue is focused on using the stats.
> The main changes are:
> 1. Create a PTableStats interface that encapsulates the stats information (and implements
the Writable interface so that it can be serialized back from the server).
> 2. Add a stats member variable off of PTable to hold this.
> 3. From MetaDataEndPointImpl, lookup the stats row for the table in the stats table.
If the stats have changed, return a new PTable with the updated stats information. We may
want to cache the stats row and have the stats gatherer invalidate the cache row when updated
so we don't have to always do a scan for it. Additionally, it would be idea if we could use
the same split policy on the stats table that we use on the system table to guarantee co-location
of data (for the sake of caching).
> - modify the client-side parallelization (ParallelIterators.getSplits()) to use this
information to guide how to chunk up the scans at query time.
> This should help boost query performance, especially in cases where the data is highly
skewed. It's likely the cause for the slowness reported in this issue: https://github.com/forcedotcom/phoenix/issues/47.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message