kylin-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KYLIN-1186) Support precise Count Distinct using bitmap
Date Fri, 22 Jan 2016 08:33:39 GMT

    [ https://issues.apache.org/jira/browse/KYLIN-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15112096#comment-15112096
] 

liyang commented on KYLIN-1186:
-------------------------------

This is great patch. Please go ahead with the merge. We use `git rebase` to ensure a streamline
of commits.

Some very minor comments.

- in BitmapSerializer.maxLength(), the comment says 32 MB, but the code returns 8 MB.
- in BitmapSerializer.getStocd rageBytesEstimate(), the result should be an estimate of the
average size in bytes (I'm updating the javadoc)

Thanks Yerui!

> Support precise Count Distinct using bitmap
> -------------------------------------------
>
>                 Key: KYLIN-1186
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1186
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>    Affects Versions: v1.1
>            Reporter: Yerui Sun
>            Assignee: Yerui Sun
>             Fix For: v2.0, v1.3
>
>         Attachments: KYLIN-1186-1.x-staging.2.patch, KYLIN-1186-1.x-staging.patch, KYLIN-1186-2.x-staging.2.patch,
KYLIN-1186-2.x-staging.3.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> For now, kylin only support non-precise count distinct by hyperloglog.
> In our production scenario, there're strongly requirements for precise count distinct,
mainly for the column of type int or bigint, such as user-id, product-id, etc.
> Implementing of precise count distinct for all types is difficult and not efficiency.
However, only supporting int or bigint make this much easier. The values can be projected
into a bitmap, which is easy to be compressed and stored, and easy to count.
> I've created a POC based on RoaringBitmap, proving that worked. There's some more work
to be done:
> * RoaringBitmap only support int, there need a solution to support bigint;
> * Add a new measure and codec, like HyperLogLogPlusCounter, make it easy to use;
> * Add new measure on web ui, and check that whether the column type is int or bigint;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message