spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Sqoop vs spark jdbc
Date Wed, 21 Sep 2016 20:47:03 GMT
I think there might be still something messed up with the classpath. It complains in the logs
about deprecated jars and deprecated configuration files.

> On 21 Sep 2016, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> Well I am left to use Spark for importing data from RDBMS table to Hadoop.
> 
> You may argue why and it is because Spark does it in one process and no errors
> 
> With sqoop I am getting this error message which leaves the RDBMS table data on HDFS
file but stops there.
> 
> 2016-09-21 21:00:15,084 [myid:] - INFO  [main:OraOopLog@103] - Data Connector for Oracle
and Hadoop is disabled.
> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:SqlManager@98] - Using default fetchSize
of 1000
> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:CodeGenTool@92] - Beginning code generation
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-hive.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-thin-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/home/hduser/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> 2016-09-21 21:00:15,681 [myid:] - INFO  [main:OracleManager@417] - Time zone has been
set to GMT
> 2016-09-21 21:00:15,717 [myid:] - INFO  [main:SqlManager@757] - Executing SQL statement:
select * from sh.sales where            (1 = 0)
> 2016-09-21 21:00:15,727 [myid:] - INFO  [main:SqlManager@757] - Executing SQL statement:
select * from sh.sales where            (1 = 0)
> 2016-09-21 21:00:15,748 [myid:] - INFO  [main:CompilationManager@94] - HADOOP_MAPRED_HOME
is /home/hduser/hadoop-2.7.3/share/hadoop/mapreduce
> Note: /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.java uses
or overrides a deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
> 2016-09-21 21:00:17,354 [myid:] - INFO  [main:CompilationManager@330] - Writing jar file:
/tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.jar
> 2016-09-21 21:00:17,366 [myid:] - INFO  [main:ImportJobBase@237] - Beginning query import.
> 2016-09-21 21:00:17,511 [myid:] - WARN  [main:NativeCodeLoader@62] - Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
> 2016-09-21 21:00:17,516 [myid:] - INFO  [main:Configuration@840] - mapred.jar is deprecated.
Instead, use mapreduce.job.jar
> 2016-09-21 21:00:17,993 [myid:] - INFO  [main:Configuration@840] - mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
> 2016-09-21 21:00:18,094 [myid:] - INFO  [main:RMProxy@56] - Connecting to ResourceManager
at rhes564/50.140.197.217:8032
> 2016-09-21 21:00:23,441 [myid:] - INFO  [main:DBInputFormat@192] - Using read commited
transaction isolation
> 2016-09-21 21:00:23,442 [myid:] - INFO  [main:DataDrivenDBInputFormat@147] - BoundingValsQuery:
SELECT MIN(prod_id), MAX(prod_id) FROM (select * from sh.sales where            (1 = 1) )
t1
> 2016-09-21 21:00:23,540 [myid:] - INFO  [main:JobSubmitter@394] - number of splits:4
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - mapred.job.name is
deprecated. Instead, use mapreduce.job.name
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - mapred.cache.files.timestamps
is deprecated. Instead, use mapreduce.job.cache.files.timestamps
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - mapreduce.map.class
is deprecated. Instead, use mapreduce.job.map.class
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - mapreduce.inputformat.class
is deprecated. Instead, use mapreduce.job.inputformat.class
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - mapreduce.outputformat.class
is deprecated. Instead, use mapreduce.job.outputformat.class
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - mapred.output.value.class
is deprecated. Instead, use mapreduce.job.output.value.class
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - mapred.output.dir
is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - mapred.cache.files
is deprecated. Instead, use mapreduce.job.cache.files
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - mapred.working.dir
is deprecated. Instead, use mapreduce.job.working.dir
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - mapred.job.classpath.files
is deprecated. Instead, use mapreduce.job.classpath.files
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - user.name is deprecated.
Instead, use mapreduce.job.user.name
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - mapred.reduce.tasks
is deprecated. Instead, use mapreduce.job.reduces
> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] - mapred.cache.files.filesizes
is deprecated. Instead, use mapreduce.job.cache.files.filesizes
> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] - mapred.output.key.class
is deprecated. Instead, use mapreduce.job.output.key.class
> 2016-09-21 21:00:23,656 [myid:] - INFO  [main:JobSubmitter@477] - Submitting tokens for
job: job_1474455325627_0045
> 2016-09-21 21:00:23,955 [myid:] - INFO  [main:YarnClientImpl@174] - Submitted application
application_1474455325627_0045 to ResourceManager at rhes564/50.140.197.217:8032
> 2016-09-21 21:00:23,980 [myid:] - INFO  [main:Job@1272] - The url to track the job: http://http://rhes564:8088/proxy/application_1474455325627_0045/
> 2016-09-21 21:00:23,981 [myid:] - INFO  [main:Job@1317] - Running job: job_1474455325627_0045
> 2016-09-21 21:00:31,180 [myid:] - INFO  [main:Job@1338] - Job job_1474455325627_0045
running in uber mode : false
> 2016-09-21 21:00:31,182 [myid:] - INFO  [main:Job@1345] -  map 0% reduce 0%
> 2016-09-21 21:00:40,260 [myid:] - INFO  [main:Job@1345] -  map 25% reduce 0%
> 2016-09-21 21:00:44,283 [myid:] - INFO  [main:Job@1345] -  map 50% reduce 0%
> 2016-09-21 21:00:48,308 [myid:] - INFO  [main:Job@1345] -  map 75% reduce 0%
> 2016-09-21 21:00:55,346 [myid:] - INFO  [main:Job@1345] -  map 100% reduce 0%
> 2016-09-21 21:00:56,359 [myid:] - INFO  [main:Job@1356] - Job job_1474455325627_0045
completed successfully
> 2016-09-21 21:00:56,501 [myid:] - ERROR [main:ImportTool@607] - Imported Failed: No enum
constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  
> 
>> On 21 September 2016 at 20:56, Michael Segel <michael_segel@hotmail.com> wrote:
>> Uhmmm… 
>> 
>> A bit of a longer-ish answer…
>> 
>> Spark may or may not be faster than sqoop. The standard caveats apply… YMMV. 
>> 
>> The reason I say this… you have a couple of limiting factors.  The main one being
the number of connections allowed with the target RDBMS. 
>> 
>> Then there’s the data distribution within the partitions / ranges in the database.
 
>> By this, I mean that using any parallel solution, you need to run copies of your
query in parallel over different ranges within the database. Most of the time you may run
the query over a database where there is even distribution… if not, then you will have one
thread run longer than the others.  Note that this is a problem that both solutions would
face. 
>> 
>> Then there’s the cluster itself. 
>> Again YMMV on your spark job vs a Map/Reduce job. 
>> 
>> In terms of launching the job, setup, etc … the spark job could take longer to
setup.  But on long running queries, that becomes noise. 
>> 
>> The issue is what makes the most sense to you, where do you have the most experience,
and what do you feel the most comfortable in using. 
>> 
>> The other issue is what do you do with the data (RDDs,DataSets, Frames, etc) once
you have read the data? 
>> 
>> 
>> HTH
>> 
>> -Mike
>> 
>> PS. I know that I’m responding to an earlier message in the thread, but this is
something that I’ve heard lots of questions about… and its not a simple thing to answer…
Since this is a batch process.  The performance issues are moot.  
>> 
>>> On Aug 24, 2016, at 5:07 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>> 
>>> Personally I prefer Spark JDBC.
>>> 
>>> Both Sqoop and Spark rely on the same drivers.
>>> 
>>> I think Spark is faster and if you have many nodes you can partition your incoming
data and take advantage of Spark DAG + in memory offering.
>>> 
>>> By default Sqoop will use Map-reduce which is pretty slow.
>>> 
>>> Remember for Spark you will need to have sufficient memory
>>> 
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
damage or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>>>  
>>> 
>>>> On 24 August 2016 at 22:39, Venkata Penikalapati <mail.venkatakarthik@gmail.com>
wrote:
>>>> Team, 
>>>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
Sqoop has lot of optimizations to fetch data does spark jdbc also has those ?
>>>> 
>>>> I'm performing few analytics using spark data for which data is residing
in rdbms. 
>>>> 
>>>> Please guide me with this. 
>>>> 
>>>> 
>>>> Thanks
>>>> Venkata Karthik P 
>>>> 
>>> 
>> 
> 

Mime
View raw message