spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Spark SQL Thriftserver with HBase
Date Sun, 09 Oct 2016 07:22:32 GMT
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative.
However, you should check if it contains a recent version of Hbase and Phoenix. That being
said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe
there are completely different solutions possible. Especially these single inserts, upserts
etc. should be avoided as much as possible in the Big Data (analysis) world with any technology,
because they do not perform well. 

Hive with Llap will provide an in-memory cache for interactive analytics. You can put full
tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense
if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive
version.

> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuild11@gmail.com> wrote:
> 
> Mich,
> 
> Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our
distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix
JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the
JDBC in-between until the Kudu team completes Spark SQL support for JDBC.
> 
> Thanks for the advice.
> 
> Cheers,
> Ben
> 
> 
>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
>> 
>> Sure. But essentially you are looking at batch data for analytics for your tableau
users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau
already.
>> 
>> I would go for Hive especially the new release will have an in-memory offering as
well for frequently accessed data :)
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>>  
>> 
>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuild11@gmail.com> wrote:
>>> Mich,
>>> 
>>> First and foremost, we have visualization servers that run Tableau for external
user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync
and segmentation data exchange. These will use JDBC directly within the same data-center.
When not colocated in the same data-center, they will connected to a located database server
using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the
JDBC industry standard.
>>> 
>>> Does this make sense?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>> 
>>>> Like any other design what is your presentation layer and end users?
>>>> 
>>>> Are they SQL centric users from Tableau background or they may use spark
functional programming.
>>>> 
>>>> It is best to describe the use case.
>>>> 
>>>> HTH
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>>>>  
>>>> 
>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheung_m@hotmail.com>
wrote:
>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
server - HBASE would work better.
>>>>> 
>>>>> Without naming specifics, there are at least 4 or 5 different implementations
of HBASE sources, each at varying level of development and different requirements (HBASE release
version, Kerberos support etc)
>>>>> 
>>>>> 
>>>>> _____________________________
>>>>> From: Benjamin Kim <bbuild11@gmail.com>
>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>>>> Cc: <user@spark.apache.org>, Felix Cheung <felixcheung_m@hotmail.com>
>>>>> 
>>>>> 
>>>>> 
>>>>> Mich,
>>>>> 
>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
that alternative.
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>> 
>>>>> I don't think it will work
>>>>> 
>>>>> you can use phoenix on top of hbase
>>>>> 
>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>> ROW                                                       COLUMN+CELL
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:Date,
timestamp=1475866783376, value=1-Apr-08
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:close,
timestamp=1475866783376, value=405.25
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:high,
timestamp=1475866783376, value=406.75
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:low,
timestamp=1475866783376, value=379.25
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:open,
timestamp=1475866783376, value=380.00
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:stock,
timestamp=1475866783376, value=TESCO PLC
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:ticker,
timestamp=1475866783376, value=TSCO
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:volume,
timestamp=1475866783376, value=49664486
>>>>> 
>>>>> And the same on Phoenix on top of Hvbase table
>>>>> 
>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10)
AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open"
AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice"
from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy')
> to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open
 | ticker  |  volume   | AverageDailyPrice  |
>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20    
 | TSCO    | 30046994  | 191.445            |
>>>>> 
>>>>> HTH
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destructionof data or any other property which may arise from relying on this
email's technical content is explicitly disclaimed.The author will in no case be liable for
any monetary damages arising from suchloss, damage or destruction.
>>>>>  
>>>>> 
>>>>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheung_m@hotmail.com>
wrote:
>>>>>> Great, then I think those packages as Spark data source should allow
you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>> 
>>>>>> I do think it will be great to get more examples around this though.
Would be great if you could share your experience with this!
>>>>>> 
>>>>>> 
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bbuild11@gmail.com>
>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <felixcheung_m@hotmail.com>
>>>>>> Cc: <user@spark.apache.org>
>>>>>> 
>>>>>> 
>>>>>> Felix,
>>>>>> 
>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
using just SQL. I have been able to CREATE tables using this statement below in the past:
>>>>>> 
>>>>>> CREATE TABLE <table-name>
>>>>>> USING org.apache.spark.sql.jdbc
>>>>>> OPTIONS (
>>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>>>>>   dbtable "dim.dimension_acamp"
>>>>>> );
>>>>>> 
>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL
JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same
with HBase tables. We tried this using Hive and HiveServer2, but the response times are just
too long.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheung_m@hotmail.com>
wrote:
>>>>>> 
>>>>>> Ben,
>>>>>> 
>>>>>> I'm not sure I'm following completely.
>>>>>> 
>>>>>> Is your goal to use Spark to create or access tables in HBASE? If
so the link below and several packages out there support that by having a HBASE data source
for Spark. There are some examples on how the Spark code look like in that link as well. On
that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL)
query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).
>>>>>> 
>>>>>> 
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bbuild11@gmail.com>
>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <felixcheung_m@hotmail.com>
>>>>>> Cc: <user@spark.apache.org>
>>>>>> 
>>>>>> 
>>>>>> Felix,
>>>>>> 
>>>>>> The only alternative way is to create a stored procedure (udf) in
database terms that would run Spark scala code underneath. In this way, I can use Spark SQL
JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I
wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in
Spark SQL?
>>>>>> 
>>>>>> What do you think? Is this the right approach?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheung_m@hotmail.com>
wrote:
>>>>>> 
>>>>>> HBase has released support for Spark
>>>>>> hbase.apache.org/book.html#spark
>>>>>> 
>>>>>> And if you search you should find several alternative approaches.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com>
wrote:
>>>>>> 
>>>>>> Does anyone know if Spark can work with HBase tables using Spark
SQL? I know in Hive we are able to create tables on top of an underlying HBase table that
can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext?
We are trying to setup a way to GET and POST data to and from the HBase table using the Spark
SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get
this to work, then we can load balance the thriftservers. In addition, this will benefit us
in giving us a way to abstract the data storage layer away from the presentation layer code.
There is a chance that we will swap out the data storage technology in the future. We are
currently experimenting with Kudu.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 

Mime
View raw message