spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Using Spark on Hive with Hive also using Spark as its execution engine
Date Mon, 30 May 2016 20:52:43 GMT
I do not think that in-memory itself will make things faster in all cases. Especially if you
use Tez with Orc or parquet. 
Especially for ad hoc queries on large dataset (indecently if they fit in-memory or not) this
will have a significant impact. This is an experience I have also with the in-memory databases
with Oracle or SQL server. It might sound surprising, but has some explanations. Orc and parquet
have the min/max indexes, store and process data (important choose the right datatype, if
everything is varchar then it is your fault that the database is not performing) very efficiently,
only load into memory what is needed. This is not the case for in-memory systems. Usually
everything is loaded in memory and not only the parts which are needed. This means due to
the absence of min max indexes you have to go through everything. Let us assume the table
has a size of 10 TB. There are different ad hoc queries that only process 1 gb (each one addresses
different areas). In hive+tez this is currently rather efficient: you load 1 gb (negligible
in a cluster) and process 1 gb.  In spark you would cache 10 tb (you do not know which can
part will be addressed) which takes a lot of time to first load and each query needs to go
in memory through 10 tb. This might be an extreme case, but it is not uncommon. An exception
are of course machine learning algorithms (the original purpose of Spark), where I see more
advantages for Spark. Most of the traditional companies have probably both use cases (maybe
with a bias towards the first). Internet companies have more towards the last.

That being said all systems are evolving. Hive supports tez+llap which is basically the in-memory
support. Spark stores the data more efficient in 1.5 and 1.6 (in the dataset Api and dataframe
- issue here that it is not the same format as the files from disk). Let's see if there will
be a convergence - my bet is that both systems will be used optimized for their use cases.

The bottom line is you have to first optimize and think what you need to do before going in-memory.
Never load everything in-memory. You will be surprised. Have multiple technologies in your
ecosystem. Understand them. Unfortunately most of the consultant companies have only poor
experience and understanding of the complete picture and thus they fail with both technologies,
which is sad, because both can be extremely powerful and a competitive  advantage.

> On 30 May 2016, at 21:49, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going hopefully to test
it as the query engine for hive. Tthough I think Spark will be faster because of its in-memory
support.
> 
> Also if you are independent then you better off dealing with Spark and Hive without the
need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 30 May 2016 at 20:19, Michael Segel <msegel_hadoop@hotmail.com> wrote:
>> Mich, 
>> 
>> Most people use vendor releases because they need to have the support. 
>> Hortonworks is the vendor who has the most skin in the game when it comes to Tez.

>> 
>> If memory serves, Tez isn’t going to be M/R but a local execution engine? Then
LLAP is the in-memory piece to speed up Tez? 
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>> 
>>> thanks I think the problem is that the TEZ user group is exceptionally quiet.
Just sent an email to Hive user group to see anyone has managed to built a vendor independent
version.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
>>>> On 29 May 2016 at 21:23, Jörn Franke <jornfranke@gmail.com> wrote:
>>>> Well I think it is different from MR. It has some optimizations which you
do not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
>>>> 
>>>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is
integrated in the Hortonworks distribution. 
>>>> 
>>>> 
>>>>> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>> 
>>>>> Hi Jorn,
>>>>> 
>>>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys
from TEZ user group kindly gave a hand but I could not go very far (or may be I did not make
enough efforts) making it work.
>>>>> 
>>>>> That TEZ user group is very quiet as well.
>>>>> 
>>>>> My understanding is TEZ is MR with DAG but of course Spark has both plus
in-memory capability.
>>>>> 
>>>>> It would be interesting to see what version of TEZ works as execution
engine with Hive.
>>>>> 
>>>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead
of Hive etc as I am sure you already know.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>>  
>>>>> 
>>>>>> On 29 May 2016 at 20:19, Jörn Franke <jornfranke@gmail.com>
wrote:
>>>>>> Very interesting do you plan also a test with TEZ?
>>>>>> 
>>>>>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I did another study of Hive using Spark engine compared to Hive
with MR.
>>>>>>> 
>>>>>>> Basically took the original table imported using Sqoop and created
and populated a new ORC table partitioned by year and month into 48 partitions as follows:
>>>>>>> 
>>>>>>> <sales_partition.PNG>
>>>>>>> ​ 
>>>>>>> Connections use JDBC via beeline. Now for each partition using
MR it takes an average of 17 minutes as seen below for each PARTITION..  Now that is just
an individual partition and there are 48 partitions.
>>>>>>> 
>>>>>>> In contrast doing the same operation with Spark engine took 10
minutes all inclusive. I just gave up on MR. You can see the StartTime and FinishTime from
below
>>>>>>> 
>>>>>>> <image.png>
>>>>>>> 
>>>>>>> This is by no means indicate that Spark is much better than MR
but shows that some very good results can ve achieved using Spark engine.
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>  
>>>>>>> 
>>>>>>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> We use Hive as the database and use Spark as an all purpose
query tool.
>>>>>>>> 
>>>>>>>> Whether Hive is the write database for purpose or one is
better off with something like Phoenix on Hbase, well the answer is it depends and your mileage
varies. 
>>>>>>>> 
>>>>>>>> So fit for purpose.
>>>>>>>> 
>>>>>>>> Ideally what wants is to use the fastest  method to get the
results. How fast we confine it to our SLA agreements in production and that helps us from
unnecessary further work as we technologists like to play around.
>>>>>>>> 
>>>>>>>> So in short, we use Spark most of the time and use Hive as
the backend engine for data storage, mainly ORC tables.
>>>>>>>> 
>>>>>>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now
we have a combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but at the
moment it is one of my projects.
>>>>>>>> 
>>>>>>>> We do not use any vendor's products as it enables us to move
away  from being tied down after years of SAP, Oracle and MS dependency to yet another vendor.
Besides there is some politics going on with one promoting Tez and another Spark as a backend.
That is fine but obviously we prefer an independent assessment ourselves.
>>>>>>>> 
>>>>>>>> My gut feeling is that one needs to look at the use case.
Recently we had to import a very large table from Oracle to Hive and decided to use Spark
1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC connection with temp
table and it was good. We could have used sqoop but decided to settle for Spark so it all
depends on use case.
>>>>>>>> 
>>>>>>>> HTH
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>  
>>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>  
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>  
>>>>>>>> 
>>>>>>>>> On 24 May 2016 at 03:11, ayan guha <guha.ayan@gmail.com>
wrote:
>>>>>>>>> Hi
>>>>>>>>> 
>>>>>>>>> Thanks for very useful stats. 
>>>>>>>>> 
>>>>>>>>> Did you have any benchmark for using Spark as backend
engine for Hive vs using Spark thrift server (and run spark code for hive queries)? We are
using later but it will be very useful to remove thriftserver, if we can. 
>>>>>>>>> 
>>>>>>>>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfranke@gmail.com>
wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Mich,
>>>>>>>>>> 
>>>>>>>>>> I think these comparisons are useful. One interesting
aspect could be hardware scalability in this context. Additionally different type of computations.
Furthermore, one could compare Spark and Tez+llap as execution engines. I have the gut feeling
that  each one can be justified by different use cases.
>>>>>>>>>> Nevertheless, there should be always a disclaimer
for such comparisons, because Spark and Hive are not good for a lot of concurrent lookups
of single rows. They are not good for frequently write small amounts of data (eg sensor data).
Here hbase could be more interesting. Other use cases can justify graph databases, such as
Titan, or text analytics/ data matching using Solr on Hadoop.
>>>>>>>>>> Finally, even if you have a lot of data you need
to think if you always have to process everything. For instance, I have found valid use cases
in practice where we decided to evaluate 10 machine learning models in parallel on only a
sample of data and only evaluate the "winning" model of the total of data.
>>>>>>>>>> 
>>>>>>>>>> As always it depends :) 
>>>>>>>>>> 
>>>>>>>>>> Best regards
>>>>>>>>>> 
>>>>>>>>>> P.s.: at least Hortonworks has in their distribution
spark 1.5 with hive 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how
to manage bringing both together. You may check also Apache Bigtop (vendor neutral distribution)
on how they managed to bring both together.
>>>>>>>>>> 
>>>>>>>>>>> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>>  
>>>>>>>>>>> I have done a number of extensive tests using
Spark-shell with Hive DB and ORC tables.
>>>>>>>>>>>  
>>>>>>>>>>> Now one issue that we typically face is and I
quote:
>>>>>>>>>>>  
>>>>>>>>>>> Spark is fast as it uses Memory and DAG. Great
but when we save data it is not fast enough
>>>>>>>>>>> 
>>>>>>>>>>> OK but there is a solution now. If you use Spark
with Hive and you are on a descent version of Hive >= 0.14, then you can also deploy Spark
as execution engine for Hive. That will make your application run pretty fast as you no longer
rely on the old Map-Reduce for Hive engine. In a nutshell what you are gaining speed in both
querying and storage.
>>>>>>>>>>>  
>>>>>>>>>>> I have made some comparisons on this set-up and
I am sure some of you will find it useful.
>>>>>>>>>>>  
>>>>>>>>>>> The version of Spark I use for Spark queries
(Spark as query tool) is 1.6.
>>>>>>>>>>> The version of Hive I use in Hive 2
>>>>>>>>>>> The version of Spark I use as Hive execution
engine is 1.3.1 It works and frankly Spark 1.3.1 as an execution engine is adequate (until
we sort out the Hadoop libraries mismatch).
>>>>>>>>>>>  
>>>>>>>>>>> An example I am using Hive on Spark engine to
find the min and max of IDs for a table with 1 billion rows:
>>>>>>>>>>>  
>>>>>>>>>>> 0: jdbc:hive2://rhes564:10010/default>  select
min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy;
>>>>>>>>>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>>>>>>>  
>>>>>>>>>>>  
>>>>>>>>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>>>>>>>>>>  
>>>>>>>>>>> INFO  : Completed compiling command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
Time taken: 1.911 seconds
>>>>>>>>>>> INFO  : Executing command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>>>>>> INFO  : Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>>>>>>> INFO  : Total jobs = 1
>>>>>>>>>>> INFO  : Launching Job 1 out of 1
>>>>>>>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial
mode
>>>>>>>>>>>  
>>>>>>>>>>> Query Hive on Spark job[0] stages:
>>>>>>>>>>> 0
>>>>>>>>>>> 1
>>>>>>>>>>> Status: Running (Hive on Spark job[0])
>>>>>>>>>>> Job Progress Format
>>>>>>>>>>> CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
>>>>>>>>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0:
0/1
>>>>>>>>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22
   Stage-1_0: 0/1
>>>>>>>>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22
   Stage-1_0: 0/1
>>>>>>>>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22
   Stage-1_0: 0/1
>>>>>>>>>>> INFO  :
>>>>>>>>>>> Query Hive on Spark job[0] stages:
>>>>>>>>>>> INFO  : 0
>>>>>>>>>>> INFO  : 1
>>>>>>>>>>> INFO  :
>>>>>>>>>>> Status: Running (Hive on Spark job[0])
>>>>>>>>>>> INFO  : Job Progress Format
>>>>>>>>>>> CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
>>>>>>>>>>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22
Stage-1_0: 0/1
>>>>>>>>>>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22
   Stage-1_0: 0/1
>>>>>>>>>>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22
   Stage-1_0: 0/1
>>>>>>>>>>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22
   Stage-1_0: 0/1
>>>>>>>>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished
      Stage-1_0: 0(+1)/1
>>>>>>>>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished
      Stage-1_0: 1/1 Finished
>>>>>>>>>>> Status: Finished successfully in 53.25 seconds
>>>>>>>>>>> OK
>>>>>>>>>>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22
Finished       Stage-1_0: 0(+1)/1
>>>>>>>>>>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22
Finished       Stage-1_0: 1/1 Finished
>>>>>>>>>>> INFO  : Status: Finished successfully in 53.25
seconds
>>>>>>>>>>> INFO  : Completed executing command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
Time taken: 56.337 seconds
>>>>>>>>>>> INFO  : OK
>>>>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>>>>> | c0  |     c1     |      c2       |        
 c3           |
>>>>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7
 |
>>>>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>>>>> 1 row selected (58.529 seconds)
>>>>>>>>>>>  
>>>>>>>>>>> 58 seconds first run with cold cache is pretty
good
>>>>>>>>>>>  
>>>>>>>>>>> And let us compare it with running the same query
on map-reduce engine
>>>>>>>>>>>  
>>>>>>>>>>> : jdbc:hive2://rhes564:10010/default> set
hive.execution.engine=mr;
>>>>>>>>>>> Hive-on-MR is deprecated in Hive 2 and may not
be available in the future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.
>>>>>>>>>>> No rows affected (0.007 seconds)
>>>>>>>>>>> 0: jdbc:hive2://rhes564:10010/default>  select
min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy;
>>>>>>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and
may not be available in the future versions. Consider using a different execution engine (i.e.
spark, tez) or using Hive 1.X releases.
>>>>>>>>>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>>>>>>>> Total jobs = 1
>>>>>>>>>>> Launching Job 1 out of 1
>>>>>>>>>>> Number of reduce tasks determined at compile
time: 1
>>>>>>>>>>> In order to change the average load for a reducer
(in bytes):
>>>>>>>>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>>>>>>>> In order to limit the maximum number of reducers:
>>>>>>>>>>>   set hive.exec.reducers.max=<number>
>>>>>>>>>>> In order to set a constant number of reducers:
>>>>>>>>>>>   set mapreduce.job.reduces=<number>
>>>>>>>>>>> Starting Job = job_1463956731753_0005, Tracking
URL = http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>>>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop
job  -kill job_1463956731753_0005
>>>>>>>>>>> Hadoop job information for Stage-1: number of
mappers: 22; number of reducers: 1
>>>>>>>>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce
= 0%
>>>>>>>>>>> INFO  : Compiling command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>>>>>> INFO  : Semantic Analysis Completed
>>>>>>>>>>> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0,
type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2,
type:double, comment:null), FieldSchema(name:c3, type:double, comment:null)], properties:null)
>>>>>>>>>>> INFO  : Completed compiling command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
Time taken: 0.144 seconds
>>>>>>>>>>> INFO  : Executing command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>>>>>> WARN  : Hive-on-MR is deprecated in Hive 2 and
may not be available in the future versions. Consider using a different execution engine (i.e.
spark, tez) or using Hive 1.X releases.
>>>>>>>>>>> INFO  : WARNING: Hive-on-MR is deprecated in
Hive 2 and may not be available in the future versions. Consider using a different execution
engine (i.e. spark, tez) or using Hive 1.X releases.
>>>>>>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and
may not be available in the future versions. Consider using a different execution engine (i.e.
spark, tez) or using Hive 1.X releases.
>>>>>>>>>>> INFO  : Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>>>>>>>> INFO  : Total jobs = 1
>>>>>>>>>>> INFO  : Launching Job 1 out of 1
>>>>>>>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial
mode
>>>>>>>>>>> INFO  : Number of reduce tasks determined at
compile time: 1
>>>>>>>>>>> INFO  : In order to change the average load for
a reducer (in bytes):
>>>>>>>>>>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>>>>>>>> INFO  : In order to limit the maximum number
of reducers:
>>>>>>>>>>> INFO  :   set hive.exec.reducers.max=<number>
>>>>>>>>>>> INFO  : In order to set a constant number of
reducers:
>>>>>>>>>>> INFO  :   set mapreduce.job.reduces=<number>
>>>>>>>>>>> WARN  : Hadoop command-line option parsing not
performed. Implement the Tool interface and execute your application with ToolRunner to remedy
this.
>>>>>>>>>>> INFO  : number of splits:22
>>>>>>>>>>> INFO  : Submitting tokens for job: job_1463956731753_0005
>>>>>>>>>>> INFO  : The url to track the job: http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>>>>>> INFO  : Starting Job = job_1463956731753_0005,
Tracking URL = http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>>>>>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop
job  -kill job_1463956731753_0005
>>>>>>>>>>> INFO  : Hadoop job information for Stage-1: number
of mappers: 22; number of reducers: 1
>>>>>>>>>>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map =
0%,  reduce = 0%
>>>>>>>>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce
= 0%, Cumulative CPU 4.56 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map =
5%,  reduce = 0%, Cumulative CPU 4.56 sec
>>>>>>>>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce
= 0%, Cumulative CPU 9.17 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map =
9%,  reduce = 0%, Cumulative CPU 9.17 sec
>>>>>>>>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce
= 0%, Cumulative CPU 14.04 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map =
14%,  reduce = 0%, Cumulative CPU 14.04 sec
>>>>>>>>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce
= 0%, Cumulative CPU 18.64 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map =
18%,  reduce = 0%, Cumulative CPU 18.64 sec
>>>>>>>>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce
= 0%, Cumulative CPU 23.25 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map =
23%,  reduce = 0%, Cumulative CPU 23.25 sec
>>>>>>>>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce
= 0%, Cumulative CPU 27.84 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map =
27%,  reduce = 0%, Cumulative CPU 27.84 sec
>>>>>>>>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce
= 0%, Cumulative CPU 32.56 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map =
32%,  reduce = 0%, Cumulative CPU 32.56 sec
>>>>>>>>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce
= 0%, Cumulative CPU 37.1 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map =
36%,  reduce = 0%, Cumulative CPU 37.1 sec
>>>>>>>>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce
= 0%, Cumulative CPU 41.74 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map =
41%,  reduce = 0%, Cumulative CPU 41.74 sec
>>>>>>>>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce
= 0%, Cumulative CPU 46.32 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map =
45%,  reduce = 0%, Cumulative CPU 46.32 sec
>>>>>>>>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce
= 0%, Cumulative CPU 50.93 sec
>>>>>>>>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce
= 0%, Cumulative CPU 55.55 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map =
50%,  reduce = 0%, Cumulative CPU 50.93 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map =
55%,  reduce = 0%, Cumulative CPU 55.55 sec
>>>>>>>>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce
= 0%, Cumulative CPU 60.25 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map =
59%,  reduce = 0%, Cumulative CPU 60.25 sec
>>>>>>>>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce
= 0%, Cumulative CPU 64.86 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map =
64%,  reduce = 0%, Cumulative CPU 64.86 sec
>>>>>>>>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce
= 0%, Cumulative CPU 69.41 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map =
68%,  reduce = 0%, Cumulative CPU 69.41 sec
>>>>>>>>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce
= 0%, Cumulative CPU 74.06 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map =
73%,  reduce = 0%, Cumulative CPU 74.06 sec
>>>>>>>>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce
= 0%, Cumulative CPU 78.72 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map =
77%,  reduce = 0%, Cumulative CPU 78.72 sec
>>>>>>>>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce
= 0%, Cumulative CPU 83.32 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map =
82%,  reduce = 0%, Cumulative CPU 83.32 sec
>>>>>>>>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce
= 0%, Cumulative CPU 87.9 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map =
86%,  reduce = 0%, Cumulative CPU 87.9 sec
>>>>>>>>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce
= 0%, Cumulative CPU 92.52 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map =
91%,  reduce = 0%, Cumulative CPU 92.52 sec
>>>>>>>>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce
= 0%, Cumulative CPU 97.35 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map =
95%,  reduce = 0%, Cumulative CPU 97.35 sec
>>>>>>>>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%, 
reduce = 0%, Cumulative CPU 99.6 sec
>>>>>>>>>>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map =
100%,  reduce = 0%, Cumulative CPU 99.6 sec
>>>>>>>>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%, 
reduce = 100%, Cumulative CPU 101.4 sec
>>>>>>>>>>> MapReduce Total cumulative CPU time: 1 minutes
41 seconds 400 msec
>>>>>>>>>>> Ended Job = job_1463956731753_0005
>>>>>>>>>>> MapReduce Jobs Launched:
>>>>>>>>>>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative
CPU: 101.4 sec   HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>>>>>>>>> Total MapReduce CPU Time Spent: 1 minutes 41
seconds 400 msec
>>>>>>>>>>> OK
>>>>>>>>>>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map =
100%,  reduce = 100%, Cumulative CPU 101.4 sec
>>>>>>>>>>> INFO  : MapReduce Total cumulative CPU time:
1 minutes 41 seconds 400 msec
>>>>>>>>>>> INFO  : Ended Job = job_1463956731753_0005
>>>>>>>>>>> INFO  : MapReduce Jobs Launched:
>>>>>>>>>>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative
CPU: 101.4 sec   HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>>>>>>>>> INFO  : Total MapReduce CPU Time Spent: 1 minutes
41 seconds 400 msec
>>>>>>>>>>> INFO  : Completed executing command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
Time taken: 142.525 seconds
>>>>>>>>>>> INFO  : OK
>>>>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>>>>> | c0  |     c1     |      c2       |        
 c3           |
>>>>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7
 |
>>>>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>>>>> 1 row selected (142.744 seconds)
>>>>>>>>>>>  
>>>>>>>>>>> OK Hive on map-reduce engine took 142 seconds
compared to 58 seconds with Hive on Spark. So you can obviously gain pretty well by using
Hive on Spark.
>>>>>>>>>>>  
>>>>>>>>>>> Please also note that I did not use any vendor's
build for this purpose. I compiled Spark 1.3.1 myself.
>>>>>>>>>>>  
>>>>>>>>>>> HTH
>>>>>>>>>>>  
>>>>>>>>>>>  
>>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>>  
>>>>>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>>>  
>>>>>>>>>>> http://talebzadehmich.wordpress.com/
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Best Regards,
>>>>>>>>> Ayan Guha
> 

Mime
View raw message