lens-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tao Yan <t...@linkedin.com>
Subject Re: Adding MySQL as a Data Source
Date Thu, 01 Sep 2016 00:34:34 GMT
Thanks! That worked.

On Tue, Aug 30, 2016 at 1:33 AM, Puneet Gupta <puneet.gupta@inmobi.com>
wrote:

> Adding example for dimension table without partitions as mentioned by Rajat
>
>
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <x_dimension_table xmlns="uri:lens:cube:0.1" dimension_name="account" table_name="account_mysql"
weight="0.1">
>     <columns>
>         <column name="id" _type="int" comment=""/>
>         <column name="account_name" _type="string" comment=""/>
>         <column name="status" _type="string" comment=""/>
>         <column name="enabled" _type="boolean" comment=""/>
>     </columns>
>     <storage_tables>
>         <storage_table>
>             <storage_name>mysql_storage</storage_name>
>             <table_desc external="true"  input_format="org.apache.hadoop.mapred.TextInputFormat"
output_format="org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" serde_class_name="org.apache.lens.storage.db.DBSerde"
storage_handler_name="org.apache.lens.storage.db.DBStorageHandler"  >
>                 <table_parameters>
>                     <property name="lens.metastore.native.table.name" value="dim_account"/>
>                     <property name="lens.metastore.native.db.name" value="lensdb"/>
>                 </table_parameters>
>             </table_desc>
>         </storage_table>
>     </storage_tables>
> </x_dimension_table>
>
>
> The actual table in mysql is lensdb.dim_account
>
>
> Thanks,
> Puneet Gupta
>
> On Tue, Aug 30, 2016 at 12:33 PM, Rajat Khandelwal <rajatgupta59@gmail.com
> > wrote:
>
>> The idea of partitions in dimtables is based on snapshots. A metadata is
>> either snapshotted (at some interval the entire metadata is added in a new
>> partition), or it's fixed (like you mentioned). In that case, I believe
>> removing the partition column declaration from the dimtable xml file should
>> work. We have the same use case and are using it in production without any
>> issue. Hive metadata is snpathotted hourly/daily and db metadata is fixed.
>>
>>
>> On Tue, Aug 30, 2016 at 2:02 AM Tao Yan <tyan@linkedin.com> wrote:
>>
>>> Hi Rajat,
>>>
>>> I created the storage and dimtable using the configurations you
>>> provided, and the query failed because no partition is added to the table,
>>> so, I added the following partition:
>>>
>>> *<x_partition fact_or_dimension_table_name="dimension1_subset"
>>> update_period="HOURLY"*
>>> *  xmlns="uri:lens:cube:0.1"
>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance
>>> <http://www.w3.org/2001/XMLSchema-instance>"*
>>> *  xsi:schemaLocation="uri:lens:cube:0.1 cube-0.1.xsd ">*
>>> *  <time_partition_spec>*
>>> *    <part_spec_element key="dt" value="2016-08-25T00:00:00"/>*
>>> *  </time_partition_spec>*
>>> *</x_partition>*
>>>
>>> And when I run the query again, Lens translate the Lens query to the
>>> following SQL:
>>>
>>>
>>>
>>> *lens-shell>select primary_key, attr1, attr5 from dimension129 Aug 2016
>>> 18:10:06 [Spring Shell] INFO  cliLogger - Query handle:
>>> ccb3b99a-22d6-4fbb-9bb8-9f06f633c4b029 Aug 2016 18:10:06 [Spring Shell]
>>> INFO  cliLogger - User query: 'select primary_key, attr1, attr5 from
>>> dimension1' was submitted to jdbc/mysql29 Aug 2016 18:10:06 [Spring Shell]
>>> INFO  cliLogger -  Driver query: 'SELECT ( dimension1 . primary_key ), (
>>> dimension1 . attr1 ), ( dimension1 . attr5 ) FROM mysql_dimension1_subset
>>> dimension1 WHERE ((((dimension1.dt = 'latest'))))' and Driver handle: null**And
>>> it failed because of 'unknown column' error:*
>>>
>>> *29 Aug 2016 17:59:24 [Spring Shell] INFO  cliLogger - Query Status:
>>> Status : FAILED Message : Query execution failed! Progress : 1.0 Error :
>>> Unknown column 'dimension1_subset__lens_dimension1_subset_dimension1.dt' in
>>> 'where clause'*
>>>
>>> *This is expected because the column dt is not part of mysql table **dimension1_subset.
>>> I added the column to the table and also set the value to 'latest', then,
>>> the query returned the result.*
>>>
>>>
>>>
>>> *​-----------------*
>>>
>>> *primary_key    attr1    attr5Results of query stored at :
>>> hdfs://ltx1-holdemnn01.grid.linkedin.com:9000/user/dev_svc/lensreports/ccb3b99a-22d6-4fbb-9bb8-9f06f633c4b0.csv
>>> <http://ltx1-holdemnn01.grid.linkedin.com:9000/user/dev_svc/lensreports/ccb3b99a-22d6-4fbb-9bb8-9f06f633c4b0.csv>
>>>  4 rows processed in (0) seconds.*
>>>
>>>
>>> *I think this is not the way Lens should work with MySQL because
>>> sometimes MySQL table cannot be changed, and it should not always be
>>> changed just because Lens is going to query it. In Hive, when a newer
>>> partition is added, Lens will automatically created a 'latest' partition, I
>>> wonder how that work in MySQL or other databases. *
>>>
>>> Is it possible to add a MySQL table to a partitioned dimtable without
>>> specifying the partition column, and assume the added MySQL table is the
>>> latest partition?
>>>
>>> Thanks,
>>>
>>> On Mon, Aug 29, 2016 at 12:46 AM, Rajat Khandelwal <
>>> rajatgupta59@gmail.com> wrote:
>>>
>>>> Use something like
>>>>
>>>> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
>>>> <x_storage xmlns="uri:lens:cube:0.1" name="mysql" classname="org.apache.lens.storage.db.DBStorage">
>>>> </x_storage>
>>>>
>>>>
>>>> <storage_table>
>>>>   <update_periods>
>>>>     <update_period>HOURLY</update_period>
>>>>   </update_periods>
>>>>   <storage_name>db</storage_name>
>>>>   <table_desc external="true"  input_format="org.apache.hadoop.mapred.TextInputFormat"
output_format="org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" serde_class_name="org.apache.lens.storage.db.DBSerde"
storage_handler_name="org.apache.lens.storage.db.DBStorageHandler" num_buckets="0" compressed="false">
>>>>     <part_cols>
>>>>       <column name="pt" _type="string" comment="date partition"/>
>>>>       <column name="dt" _type="string" comment="date partition"/>
>>>>       <column name="ot" _type="string" comment="date partition"/>
>>>>     </part_cols>
>>>>     <table_parameters>
>>>>       <property name="lens.metastore.native.table.name" value="table_name_in_mysql_db"/>
>>>>       <property name="lens.metastore.native.db.name" value="mysql_db_name"/>
>>>>       <property name="lens.metastore.native.table.column.mapping"
>>>>                 value="pt=process_time,field1=field1_actual_name,field2=field2_actual_name"/>
>>>>       <property name="cube.storagetable.start.times" value="now - 4 days"/>
>>>>     </table_parameters>
>>>>     <time_part_cols>pt</time_part_cols>
>>>>     <time_part_cols>dt</time_part_cols>
>>>>     <time_part_cols>ot</time_part_cols>
>>>>   </table_desc>
>>>> </storage_table>
>>>>
>>>>
>>>> On Sat, Aug 27, 2016 at 3:42 AM Tao Yan <tyan@linkedin.com> wrote:
>>>>
>>>>> Hi Lens Developers,
>>>>>
>>>>> I am trying to add mysql as a data source, so, I created the driver as
>>>>> follows:
>>>>>
>>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>>>
>>>>> <configuration>
>>>>>   <property>
>>>>>     <name>lens.driver.jdbc.driver.class</name>
>>>>>     <value>com.mysql.jdbc.Driver</value>
>>>>>   </property>
>>>>>   <property>
>>>>>     <name>lens.driver.jdbc.db.uri</name>
>>>>>     <value>jdbc:mysql://****mysql**hostname****/lens;user=****;p
>>>>> assowrd=****</value>
>>>>>   </property>
>>>>>   <property>
>>>>>     <name>lens.driver.jdbc.db.user</name>
>>>>>     <value>****</value>
>>>>>   </property>
>>>>>   <property>
>>>>>     <name>lens.cube.query.driver.supported.storages</name>
>>>>>     <value>mysql</value>
>>>>>     <final>true</final>
>>>>>   </property>
>>>>>   <property>
>>>>>     <name>lens.driver.jdbc.query.rewriter</name>
>>>>>     <value>org.apache.lens.driver.jdbc.ColumnarSQLRewriter</value>
>>>>>   </property>
>>>>>   <property>
>>>>>     <name>lens.driver.jdbc.explain.keyword</name>
>>>>>     <value>explain plan for </value>
>>>>>   </property>
>>>>> </configuration>
>>>>>
>>>>> And created a test table with data in mysql:
>>>>>
>>>>> create table dimension1_subset (
>>>>> primary_key BIGINT(8) not null primary key,
>>>>> attr1 BIGINT(8),
>>>>> attr5 BIGINT(8)
>>>>> );
>>>>>
>>>>> [image: Screen Shot 2016-08-26 at 2.44.24 PM.png]
>>>>> ​
>>>>> And created the storage file:
>>>>>
>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>>
>>>>> <x_storage classname="org.apache.lens.storage.db.DBStorage"
>>>>> name="mysql" xmlns="uri:lens:cube:0.1"
>>>>>   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>>> xsi:schemaLocation="uri:lens:cube:0.1 cube-0.1.xsd ">
>>>>>   <properties>
>>>>>     *<property name="lens.storage.db.url"
>>>>> value="jdbc:mysql://lva1-db34.corp.linkedin.com/lens
>>>>> <http://lva1-db34.corp.linkedin.com/lens>"/>*
>>>>>   </properties>
>>>>> </x_storage>
>>>>>
>>>>> I am confused when set the value for *lens.storage.db.url *because* it
>>>>> is neither local filesystem nor HDFS.*
>>>>>
>>>>> And I don't know what should the storage table look like:
>>>>>
>>>>> <x_dimension_table dimension_name="dimension1"
>>>>> table_name="dimension1_table3" weight="5.0" xmlns="uri:lens:cube:0.1"
>>>>>   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>>> xsi:schemaLocation="uri:lens:cube:0.1 cube-0.1.xsd ">
>>>>>   <columns>
>>>>>     <column name="primary_key" _type="BIGINT"/>
>>>>>     <column name="attr1" _type="BIGINT"/>
>>>>>     <column name="attr5" _type="BIGINT"/>
>>>>>   </columns>
>>>>>   <properties>
>>>>>     <property name="dimension1.prop" value="t3"/>
>>>>>   </properties>
>>>>>   <storage_tables>
>>>>>
>>>>>
>>>>> *    <storage_table>      ...    </storage_table>*
>>>>>   </storage_tables>
>>>>> </x_dimension_table>
>>>>>
>>>>> What should I use for the table_location, and what about part_cols?
>>>>> How do I add data to this table?
>>>>>
>>>>> Thanks,
>>>>> --
>>>>>
>>>>> *Tao Yan*
>>>>> Software Engineer
>>>>> Data Analytics Infrastructure Tools and Services
>>>>>
>>>>>
>>>>>
>>>>> 206.250.5345
>>>>> tyan@linkedin.com
>>>>> https://www.linkedin.com/in/taousc
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *Tao Yan*
>>> Software Engineer
>>> Data Analytics Infrastructure Tools and Services
>>>
>>>
>>>
>>> 206.250.5345
>>> tyan@linkedin.com
>>> https://www.linkedin.com/in/taousc
>>>
>>
>
> _____________________________________________________________
> The information contained in this communication is intended solely for the
> use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify
> us immediately by responding to this email and then delete it from your
> system. The firm is neither liable for the proper and complete transmission
> of the information contained in this communication nor for any delay in its
> receipt.




-- 

*Tao Yan*
Software Engineer
Data Analytics Infrastructure Tools and Services



206.250.5345
tyan@linkedin.com
https://www.linkedin.com/in/taousc

Mime
View raw message