phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <>
Subject [jira] [Commented] (PHOENIX-1940) Push expected List<Cell> ordinal position in KeyValueColumnExpression
Date Sun, 13 Sep 2015 01:08:45 GMT


Lars Hofhansl commented on PHOENIX-1940:

As a test I shortened all the CQs like so:
LN  INTEGER not null,
QTY    DECIMAL(15,2) ,
EP  DECIMAL(15,2) ,
DSC    DECIMAL(15,2) ,
TAX         DECIMAL(15,2) ,
RF  CHAR(1) ,
LS  CHAR(1) ,
SD    DATE ,
SI CHAR(25) ,
SM     CHAR(10) ,
CO      VARCHAR(44),
constraint pk primary key (ok, ln));
The result is 1.22GB in size (as opposed to 1.77GB with the longer names).

To my surprise the query is _not_ noticeably faster! Most the time is still spent in,
followed by SQM.match, followed by FastDiffDeltaEncode.decodeNext.

The good news is that now accounts for about 30% of the CPU time
(was around 60% before), but there's still a lot of work to do (Phoenix and HBase)

> Push expected List<Cell> ordinal position in KeyValueColumnExpression
> ---------------------------------------------------------------------
>                 Key: PHOENIX-1940
>                 URL:
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
> Looks like quite a bit of time is spent in the binary search done to get the latest Cell
value when we're evaluating expressions on the server side (up to 60% is spent in KeyValueUtil.getColumnLatest()).
Since we know the set of column qualifiers being projected into the scan, we could push the
expected position (assuming all columns have values). If the Cell is not in that position,
we could fall back to a binary search.
> Further enhancements could be to: allow a not null constraint on KeyValue columns and
either a) require all non null values to be provided on an UPSERT, or b) do a check and put
to enforce it (for transactional tables this could be enforced).
> Additionally, the table could declare that dynamic columns are not allowed. If both of
the above are true, then we'd be able guaranteed positional access the List<Cell> that
we get back from an HBase Scanner.
> One further enhancement would be to collect a set of all ColumnExpression instances on
the server side for all expressions sent over. Then, we'd bind them once, outside of the general
expression evaluation of all expressions in a statement for a given row. An example of where
this would save time would be in evaluating the following TPCH-Q1 aggregate query:
> {code}
>     l_returnflag,
>     l_linestatus,
>     sum(l_quantity) as sum_qty,
>     sum(l_extendedprice) as sum_base_price,
>     sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
>     sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
>     avg(l_quantity) as avg_qty,
>     avg(l_extendedprice) as avg_price,
>     avg(l_discount) as avg_disc,
>     count(*) as count_order
>     lineitem
>     l_shipdate <= date '1998-12-01' - interval '90' day
>     l_returnflag,
>     l_linestatus
>     l_returnflag,
>     l_linestatus;
> {code}
> During aggregation, the KeyValueColumnExpression for l_extendedprice would be evaluated
four times currently, once per occurrence in different SELECT expressions. This enhancement
would cut that down to once.

This message was sent by Atlassian JIRA

View raw message