lens-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dayou Jiang <fel...@gmail.com>
Subject Re: Lens Query Conversion
Date Thu, 11 Aug 2016 18:29:45 GMT
Logged a bug here: https://issues.apache.org/jira/browse/LENS-1276

Thanks,
Dayou Jiang

On Wed, Aug 10, 2016 at 9:05 PM, amareshwarisr . <amareshwari@gmail.com>
wrote:

> Dayou Jiang,
>
> Seems a bug. Please file a jira at https://issues.apache.org/
> jira/browse/LENS.
>
> Thanks
>
> On Thu, Aug 11, 2016 at 12:34 AM, Dayou Jiang <feltjd@gmail.com> wrote:
>
>> Hi,
>>
>> I would like to add something in this partition discussion:
>>
>> I found out that Lens doesn't allow me to use the "update fact" command
>> to change the partition update periods. For example, i have the "<
>> update_period>DAILY</update_period>" tag in one of my fact table
>> definition and i now decide to have the
>> MONTHLY partition too, but i couldn't update my fact table to accept
>> that. Even after i dropped the entire fact table
>> and tried to create it again with both "DAILY" and "MONTHLY" as the
>> update periods i still couldn't add the MONTHLY partition.
>> I think some of the meta data still got hanging around and not being
>> deleted after i dropped the table.
>> Is this a bug at Lens?
>>
>>
>> Thanks,
>> Dayou Jiang
>>
>> On Wed, Aug 10, 2016 at 11:36 AM, Tao Yan <tyan@linkedin.com> wrote:
>>
>>> Thanks for answering the question! It is helpful.
>>>
>>> On Wed, Aug 10, 2016 at 1:15 AM, amareshwarisr . <amareshwari@gmail.com>
>>> wrote:
>>>
>>>> We have a mapping time dimension columns to its partition columns,
>>>> because system's like HIVE would require partition column to be separate
>>>> column than table columns.
>>>>
>>>> So, if tables already contain order_time as a column, and the data is
>>>> partitioned by order_time - then partition column name has to be a separate
>>>> column - which is defined through the mapping.
>>>>
>>>> Thanks
>>>>
>>>> On Wed, Aug 10, 2016 at 1:40 PM, Rajat Khandelwal <prongs@apache.org>
>>>> wrote:
>>>>
>>>>> "value" is required in a "property" tag. A property tag is name and
>>>>> value. This special property defines that whenever queried on order_time
>>>>> time dimension, try to answer with facts that have "ot" as
>>>>> partition column. If not, it falls back to delivery_time time dimension.
>>>>> There again, the mapping is checked, which is "dt", so facts that have
"dt"
>>>>> as partition column are preferred. This can happen in a chain until a
valid
>>>>> combination is found which can answer the query.
>>>>>
>>>>>
>>>>> On Wed, Aug 10, 2016 at 12:13 AM Tao Yan <tyan@linkedin.com> wrote:
>>>>>
>>>>>> So, the field 'value' can be used to define both alias and real
>>>>>> values?
>>>>>>
>>>>>> On Tue, Aug 9, 2016 at 11:40 AM, Rajat Khandelwal <prongs@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, that is correct.
>>>>>>>
>>>>>>> On Wed, Aug 10, 2016, 00:09 Tao Yan <tyan@linkedin.com>
wrote:
>>>>>>>
>>>>>>>> Thanks for the explanation. I got the idea of tentative partitions.
>>>>>>>>
>>>>>>>> Is the property
>>>>>>>> *<property name="cube.timedim.partition.order_time" value="ot"
/>*
>>>>>>>>  referring to the dim_attribute
>>>>>>>> *<dim_attribute name="order_time" _type="TIMESTAMP" />
*?
>>>>>>>>
>>>>>>>> Is the value 'ot' an alias of the partition column name?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 9, 2016 at 12:47 AM, Rajat Khandelwal <
>>>>>>>> prongs@apache.org> wrote:
>>>>>>>>
>>>>>>>>> The idea of partitions is that there is a stream of data
that
>>>>>>>>> keeps flowing in, each line of which has delivery time
and order time both.
>>>>>>>>> The data is partitioned and stored in partitions later
and facts are
>>>>>>>>> created on top of that partitioned data. So if data is
partitioned in order
>>>>>>>>> time and the time range queried is also order time, then
we can pick the
>>>>>>>>> partitions directly. But if the data is partitioned on
delivery time, as is
>>>>>>>>> the case for this fact, and the query is on order time,
then you pick some
>>>>>>>>> tentative partitions of delivery time, but you still
have to filter over
>>>>>>>>> order time. Here the purpose of picking partitions is
to reduce the amount
>>>>>>>>> of data read. When you pick delivery time partitions,
it doesn't guarantee
>>>>>>>>> what order times data has, it only provides a tentative
range. Hence, the
>>>>>>>>> partitions are picked based on the relation of time dimensions,
but the
>>>>>>>>> extra filter still has to be supplied.
>>>>>>>>>
>>>>>>>>> Hope it's clear.
>>>>>>>>>
>>>>>>>>> On Tue, Aug 9, 2016 at 10:30 AM Rajat Khandelwal <
>>>>>>>>> prongs@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> That's the original time range asked in the cube
query.
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 9, 2016, 01:46 Tao Yan <tyan@linkedin.com>
wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Rajat,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the explain. I have a clear understanding
about dim
>>>>>>>>>>> relation now, however, how does the following
condition generated:
>>>>>>>>>>>
>>>>>>>>>>> and ((sales.order_time) >= '2015-04-13 03:00:00')
and ((sales.order_time) < '2015-04-13 04:00:00')))
>>>>>>>>>>>
>>>>>>>>>>> order_time is a dim attributes defined in sales
cube, and it does not have any relationships with properties or other dim attributes (The
property cube.timedim.partition.order_time does, but is it the same thing?), I wonder how
the query could pick up this column.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 8, 2016 at 4:12 AM, Rajat Khandelwal
<
>>>>>>>>>>> prongs@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I've created a jira for this: https://issues.apache.or
>>>>>>>>>>>> g/jira/browse/LENS-1269
>>>>>>>>>>>>
>>>>>>>>>>>> You should be able to make change in your
data locally and see
>>>>>>>>>>>> the non-empty results:
>>>>>>>>>>>>
>>>>>>>>>>>> cat /Users/rajat.khandelwal/Git/le
>>>>>>>>>>>> ns/lens-dist/target/apache-lens-2.7.0-SNAPSHOT-bin/apache-le
>>>>>>>>>>>> ns-2.7.0-SNAPSHOT-bin/client/examples/data/sales_aggr_fact2_
>>>>>>>>>>>> local3/sales-aggr-fact2-local3.data
>>>>>>>>>>>> 2015-04-13 03:00:00,2015-04-13 04:00:00,1,1,1,1,1,1,5,0
>>>>>>>>>>>> 2015-04-13 03:00:00,2015-04-13 04:00:00,2,1,2,2,2,1,8,2
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 8, 2016 at 2:46 PM Rajat Khandelwal
<
>>>>>>>>>>>> prongs@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> More details on how it converts to the
fallback range:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Range asked:
>>>>>>>>>>>>> (order_time, '2015-04-13-03', '2015-04-13-04')
>>>>>>>>>>>>>
>>>>>>>>>>>>> Timedim Relation :
>>>>>>>>>>>>>
>>>>>>>>>>>>> <property name="cube.timedim.relation.order_time"
value="delivery_time+[-20 days,-1 hour]" />
>>>>>>>>>>>>>
>>>>>>>>>>>>> <property name="cube.timedim.relation.order_time"
value="delivery_time+[-2 hours,-1hour]" />
>>>>>>>>>>>>>
>>>>>>>>>>>>> The cube defines the first relation,
the fact defines the second relation. A fact can optionally override the timedim relation
defined by cube. So the second one is picked :
>>>>>>>>>>>>>
>>>>>>>>>>>>> Derivation:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-04-13-03 <=ot < 2015-04-13-04
>>>>>>>>>>>>> dt - 2 hours <= ot <= dt - 1 hour
=> ot + 1 hour <= dt <= ot +
>>>>>>>>>>>>> 2 hour
>>>>>>>>>>>>> => 2015-04-13-04 <= dt < 2015-04-13-06.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hope it's clearer now.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Aug 8, 2016 at 1:29 PM Rajat
Khandelwal <
>>>>>>>>>>>>> prongs@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So it seems right now the query is
designed to return 0 rows.
>>>>>>>>>>>>>> The query is there just to demonstrate
the fallback feature.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cube query is
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> cube select customer_city_name, store_cost
from sales where time_range_in(order_time,
>>>>>>>>>>>>>> '2015-04-13-03', '2015-04-13-04')
>>>>>>>>>>>>>> Driver query comes out to be
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> SELECT (customer_city.name), sum((sales.store_cost))
FROM a.local_sales_aggr_fact2 sales join a.local_city_table customer_city on sales.customer_city_id
= customer_city.id and (customer_city.dt = 'latest') WHERE (((((sales.dt) = '2015-04-13-04')
or ((sales.dt) = '2015-04-13-05')) and ((sales.order_time) >= '2015-04-13 03:00:00') and
((sales.order_time) < '2015-04-13 04:00:00'))) GROUP BY (customer_city.name)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then I modified the driver query
and ran:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> SELECT (customer_city.name), sales.order_time,
sales.store_cost FROM a.local_sales_aggr_fact2 sales join a.local_city_table customer_city
on sales.customer_city_id = customer_city.id and (customer_city.dt = 'latest') WHERE (((((sales.dt)
= '2015-04-13-04') or ((sales.dt) = '2015-04-13-05'))))
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Found the following results:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bangalore^A2015-04-13 00:00:00^A0.0
>>>>>>>>>>>>>> Hyderabad^A2015-04-13 00:00:00^A2.0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's apparent that both the rows
will be filtered out by the
>>>>>>>>>>>>>> translated query.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll look deeper and replace the
example query with something
>>>>>>>>>>>>>> that demonstrates the fallback feature
and also returns non-empty results.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let us know if you have any confusion
understanding the
>>>>>>>>>>>>>> feature itself. I think the video
should have covered it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> *Tao Yan*
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Data Analytics Infrastructure Tools and Services
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 206.250.5345
>>>>>>>>>>> tyan@linkedin.com
>>>>>>>>>>> https://www.linkedin.com/in/taousc
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> *Tao Yan*
>>>>>>>> Software Engineer
>>>>>>>> Data Analytics Infrastructure Tools and Services
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 206.250.5345
>>>>>>>> tyan@linkedin.com
>>>>>>>> https://www.linkedin.com/in/taousc
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> *Tao Yan*
>>>>>> Software Engineer
>>>>>> Data Analytics Infrastructure Tools and Services
>>>>>>
>>>>>>
>>>>>>
>>>>>> 206.250.5345
>>>>>> tyan@linkedin.com
>>>>>> https://www.linkedin.com/in/taousc
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *Tao Yan*
>>> Software Engineer
>>> Data Analytics Infrastructure Tools and Services
>>>
>>>
>>>
>>> 206.250.5345
>>> tyan@linkedin.com
>>> https://www.linkedin.com/in/taousc
>>>
>>
>>
>

Mime
View raw message