hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Demai Ni <nid...@gmail.com>
Subject Re: ETL HBase HFile+HLog to ORC(or Parquet) file?
Date Fri, 21 Oct 2016 21:28:35 GMT
Mich,

thanks for the detail instructions.

While aware of the Hive method, I have a few questions/concerns:
1) the Hive method is a "INSERT FROM SELECT " ,which usually not perform as
good as a bulk load though I am not familiar with the real implementation
2) I have another SQL-on-Hadoop engine working well with ORC file. So if
possible, I'd like to avoid the system dependency on Hive(one fewer
component to maintain).
3) HBase has well running back-end process for Replication(HBASE-1295) or
Backup(HBASE-7912), so  wondering anything can be piggy-back on it to deal
with day-to-day works

The goal is to have HBase as a OLTP front(to receive data), and the ORC
file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC
file will also serve as my backup in the case for DR.

Demai


On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Create an external table in Hive on Hbase atble. Pretty straight forward.
>
> hive>  create external table marketDataHbase (key STRING, ticker STRING,
> timecreated STRING, price STRING)
>
>     STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
> SERDEPROPERTIES ("hbase.columns.mapping" =
> ":key,price_info:ticker,price_info:timecreated, price_info:price")
>
>     TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");
>
>
>
> then create a normal table in hive as ORC
>
>
> CREATE TABLE IF NOT EXISTS marketData (
>      KEY string
>    , TICKER string
>    , TIMECREATED string
>    , PRICE float
> )
> PARTITIONED BY (DateStamp  string)
> STORED AS ORC
> TBLPROPERTIES (
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="KEY",
> "orc.bloom.filter.fpp"="0.05",
> "orc.compress"="SNAPPY",
> "orc.stripe.size"="16777216",
> "orc.row.index.stride"="10000" )
> ;
> --show create table marketData;
> --Populate target table
> INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
> SELECT
>       KEY
>     , TICKER
>     , TIMECREATED
>     , PRICE
> FROM MarketDataHbase
>
>
> Run this job as a cron every often
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 October 2016 at 21:48, Demai Ni <nidmgg@gmail.com> wrote:
>
> > hi,
> >
> > I am wondering whether there are existing methods to ETL HBase data to
> > ORC(or other open source columnar) file?
> >
> > I understand in Hive "insert into Hive_ORC_Table from SELET * from
> > HIVE_HBase_Table", can probably get the job done. Is this the common way
> to
> > do so? Performance is acceptable and able to handle the delta update in
> the
> > case HBase table changed?
> >
> > I did a bit google, and find this
> > https://community.hortonworks.com/questions/2632/loading-
> > hbase-from-hive-orc-tables.html
> >
> > which is another way around.
> >
> > Will it perform better(comparing to above Hive stmt) if using either
> > replication logic or snapshot backup to generate ORC file from hbase
> tables
> > and with incremental update ability?
> >
> > I hope to has as fewer dependency as possible. in the Example of ORC,
> will
> > only depend on Apache ORC's API, and not depend on Hive
> >
> > Demai
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message