kylin-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "hongbin ma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KYLIN-1313) Enable deriving dimensions on non PK/FK
Date Wed, 13 Jan 2016 11:58:39 GMT

    [ https://issues.apache.org/jira/browse/KYLIN-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096073#comment-15096073
] 

hongbin ma commented on KYLIN-1313:
-----------------------------------

hi

I'm currently working on it on 2.x-staging. not sure if we'll backport to 1.x versions. For
future reference I'll call the new feature "heavy deriving" in comparison with previous light
weight deriving as that only requires to look up deriving relation in a lookup table snapshot.

Before heavy deriving is released we suggested not include the item_url dim into the cube,
instead to use external KV system to extend result record from (X,Y,item_id) to (X,Y,item_id,item_ur).
Soon we realized that it is too much requirement for normal users, and it's best if could
provide a one-stop solution.

The implementation of heavy deriving is non-trivial, because we cannot maintain the item_id=>item_url
mapping in the memory. It's also not a good idea to save the mapping in a external KV store
as it will downgrade the performance of cube building and query. Our current plan is to make
a trade of between functionality and performance: A basic assumption is introduced that the
derived columns will not participate in any kind of filter. For item_id => item_url case,
the user CANNOT specify filter on the item_url dim. The only thing heavy derived dim item_url
enables is that when your final result contains item_id, you can simultaneously retrieve item_url
as well, that's all. (Of courser there's a hidden assumption here: item_url is uniquely determined
by each item_id, because it is deriving!)

With the assumption(s) in mind, we can save the item_url as a special measure. take a cuboid
with 2 dimensions dt and item_id as an example, a tuple in the cuboid should exhibit pattern
of:

Key: 2015-1-1,4234324
Value: Metric1,Metric2,http://items.ebay.com/4234324

where http://items.ebay.com/4234324 is the item_url for item_id 4234324.

At query time, we'll use another IDerivedColumnFiller to retrieve the item_url value from
the cuboid tuple and return both item_id and item_url to the user. (if item_url is required)

We'll only append the item_url measure to only cuboids that has item_id as a dimension to
avoid unnecessary duplication. Actually, we can proper configure the cuboid whitelist (https://issues.apache.org/jira/browse/KYLIN-242)to
make sure only one copy of item_url exist in all of the cuboids.

Please let me know if this design will solve your problem, it's open for discussion

> Enable deriving dimensions on non PK/FK
> ---------------------------------------
>
>                 Key: KYLIN-1313
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1313
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: hongbin ma
>            Assignee: hongbin ma
>
> currently derived column has to be columns on look table, and the derived host column
has to be PK/FK(It's also a problem when the lookup table grows every large). Sometimes columns
on the fact exhibit deriving relationship too. Here's an example fact table:
> (dt date, seller_id bigint, seller_name varchar(100) , item_id bigint, item_url varchar(1000),
count decimal, price decimal)
> seller_name is uniquely determined by each seller id, and item_url is uniquely determined
by each item_id. The users does not expect to do filtering on columns like seller name or
item_url, they just want to retrieve it when they do grouping/filtering on other dimensions
like selller id, item id or even other dimensions like dt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message