drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Federico Ponzi <federico.ponz...@gmail.com>
Subject Suggestions
Date Wed, 03 Feb 2016 12:58:34 GMT
Sorry for the double email, I presed the wrong key combination.

Hi all,
I have daily snapshots of a database. After some time, some snapshots are
deleted for retention purpose.

So I have a situation like this:

   - 1 month ago: I have every day snapshots for every day of the past month
   - >= 2 month ago: I have a snapshots every 15 day
   - >= 6 month ago: I have a snapshot per month (say on every 1st of the
   month).


Now I would like to be able to query all this data. Since every snapshot is
about 250GB I naturally went to Big Data direction.
I think that Apache Drill is a great tool, and can be used to resolve my
problem, because i also need to be able to upload data from other sources
and compare it with the snapshots (for analysis purpose).

Also my snapshots can change Schema in time (another reason to search for a
schema-less solution).

Now, if I would need to do be able to execute statistics query, for
example:

Snapshot day 1:
    My_table
    id | title         | category   | date
    1  | My First Post | helloworld | 2015-01-01
Snapshot day 2 (A new post is added today):
    My_table
    id | title          | category   | date
    1  | My first post  | helloworld |  2015-01-01
    2  | My second post | other      | 2015-01-02
Snapshot day 3 (Post 2 is removed today):
    My_table
    id | title         | category   | date
    1  | My First Post | helloworld | 2015-01-01

So between days, a row of the table may or not may be constant. Now, I need
to be able to use a query like this:

SELECT category, COUNT(*) from day1.My_table group by category

This is for one table of one day. If we *want to count the daily average of
posts by category in a month* we should do something like:

SELECT category, SUM(cnt) / 30 from (
    SELECT category, COUNT(*) as cnt from day1.My_table group by category
    UNION ALL     SELECT category, COUNT(*) as cnt from day2.My_table
group by category
    UNION ALL ...
    UNION ALL     SELECT category, COUNT(*) as cnt from day30.My_table
group by category) group by category

Another example, the *number of post published in a month*:

SELECT COUNT(distinct id) from ( SELECT id FROM day1.My_table UNION
ALL ... UNION ALL SELECT id from day30.My_table);

The problem is that because of retention, there can be situation where *there
aren't snapshot* for every day of the month. So for example, after 1 year
we could have 1 snapshot per month. So we couldn't be able to perform a
query like that.

Basically we would need to consider a weight. If we have *day1.My_table*
and *day5.My_table*, every post that is in day1 and not in day5 will be
counted as it was also in day 2,3,4. Every post that is day1 and day5 will
count as if it is in every day of the month (= untill the next snapshot).

So in case i would like to consider the average number of post per day of
>=6 months a go, where I have just 1 snapshot, I would assign to that
snapshot a weight of 30.

So, the avrage post published in a month for a range >= 6 months ago is:

SELECT category, SUM(cnt) / 30 from (
    SELECT category, COUNT(*)*30 as cnt from day1.My_table group by
category --- Note: I'm not considering the range defined from the user
in this example.
) group by category;

As comment also stated, I would need to do a query like:

Select category, AVG(*) from [fromRange-toRange].MyTable;

For an extreme solution, I'm considering the idea of implementing a
metalanguage to let the future user (ex. marketng people) to do a query
like this.

Do you think there is a way to accomplish this in Drill without the
meta-language? I would do this using a recursive UDF but they can't return
queries.

Thanks for any help you will provide!

Federico Ponzi

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message