sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Robson <David.Rob...@software.dell.com>
Subject RE: Using more than a single mapper per partition with OraOop
Date Sat, 08 Nov 2014 01:52:36 GMT
So it queries dba_extents to get the extent map – I am presuming your database must have
a lot of extents hence why the query takes a long time. If you edit oraoop-site.xml and look
at “oraoop.oracle.session.initialization.statements” you can get rid of the statement
to disable parallel query. You could then add NOPARALLEL(t) to “oraoop.import.hint” to
ensure the actual mappers don’t run in parallel. This is no guarantee the dba_extents query
will run in parallel but you could give it a shot.

As for the error – this looks to be related in that you have so many blocks it has caused
the number not to fit into an int – this should be a simple fix to change the datatype to
a a long throughout the code. I’ll log an issue next week and look in to this.

From: Joshua Baxter [mailto:joshuagbaxter@gmail.com]
Sent: Friday, 7 November 2014 6:24 AM
To: user@sqoop.apache.org
Subject: Re: Using more than a single mapper per partition with OraOop

Hi David,

Thanks for your reply. I think the incorrect chunking is exactly my problem. I've now changed
that, however i have hit a couple more issues. Firstly, after launching the job I am now getting
the following error  after the query to fetch the block information.

14/11/06 14:17:58 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException:
java.sql.SQLException: Numeric Overflow
        at com.quest.oraoop.OraOopDataDrivenDBInputFormat.getSplits(OraOopDataDrivenDBInputFormat.java:120)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:498)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:515)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:399)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
        at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:198)
        at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:171)
        at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:268)
        at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:665)
        at com.quest.oraoop.OraOopConnManager.importTable(OraOopConnManager.java:260)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Caused by: java.sql.SQLException: Numeric Overflow
        at oracle.jdbc.driver.NumberCommonAccessor.throwOverflow(NumberCommonAccessor.java:4170)
        at oracle.jdbc.driver.NumberCommonAccessor.getInt(NumberCommonAccessor.java:119)
        at oracle.jdbc.driver.GeneratedStatement.getInt(GeneratedStatement.java:217)
        at oracle.jdbc.driver.GeneratedScrollableResultSet.getInt(GeneratedScrollableResultSet.java:522)
        at oracle.jdbc.driver.GeneratedResultSet.getInt(GeneratedResultSet.java:1350)
        at com.quest.oraoop.OraOopOracleQueries.getOracleDataChunksExtent(OraOopOracleQueries.java:271)
        at com.quest.oraoop.OraOopDataDrivenDBInputFormat.getSplits(OraOopDataDrivenDBInputFormat.java:74)
        ... 23 more

Secondly, the query to pull out the block data is taking a really really long time. About
10 minutes. This query is also running with parallel querying disabled. Is was wondering if
there is any way to enable it for just this step to help reduce latency?



On Wed, Nov 5, 2014 at 2:00 AM, David Robson <David.Robson@software.dell.com<mailto:David.Robson@software.dell.com>>
For importing a single partition – you should be able to set the chunk method to ROWID,
then set oraoop.import.partitions to your partition you are importing. This will split that
one partition by ROWID to as many mappers as you like.

Also – you shouldn’t need any no parallel hints – the direct connector disables parallel
query when it first connects so that shouldn’t be a problem.

So in your command below can you change oraoop.chunk.method to ROWID (or just leave it out
– it is the default) and let me know if that works for you?

From: Joshua Baxter [mailto:joshuagbaxter@gmail.com<mailto:joshuagbaxter@gmail.com>]
Sent: Tuesday, 4 November 2014 8:53 AM
To: user@sqoop.apache.org<mailto:user@sqoop.apache.org>
Subject: Re: Using more than a single mapper per partition with OraOop

We will mostly be wanting to bring in a single partition at a time, but there will also be
occasions where would we need to pull down the whole table.

sqoop import  -Doraoop.import.hint="no_parallel" -Doraoop.chunk.method=PARTITION -Doraoop.timestamp.string=false
-Doraoop.import.partitions=partition_name  --connect connect_string  --table "WAREHOUSE.BIG_TABLE"
--fetch-size  100000 -m 20 --target-dir /user/hive/warehouse/database/partition   --as-parquetfile
--username user --password password

On Mon, Nov 3, 2014 at 9:40 PM, Gwen Shapira <gshapira@cloudera.com<mailto:gshapira@cloudera.com>>
Do you need to get just one partition, or is the ultimate goal to use all partitions?

Also, can you share the exact Oraoop command you used?

On Mon, Nov 3, 2014 at 1:32 PM, Joshua Baxter <joshuagbaxter@gmail.com<mailto:joshuagbaxter@gmail.com>>
Apologies if this question has been asked before.

I have a very large table in Oracle with hundreds of partitions and we want to be able to
import it to parquet in HDFS a partition at a time as part of a ETL process. The table has
evolved over time and there is not a column that doesn't have significant skew meaning that
mappers get very uneven numbers when using the standard sqoop connector and split-by. Impala
is the target platform that the data is for so we also want to keep the file sizes under the
cluster block size to prevent remote streaming when we use the data. I've just discovered
OraOop and it sounds like this would be exactly tool we would need to import the data in an
efficient and predictable way.

Unfortunately the problem i'm now having is that if i use the partition option to choose just
a single partition this always equates to exactly one mapper. The sort of speed and output
file sizes we are looking at would equate to something like 40.

Are there any options i can set to increase the number of mappers when pulling data from a
single table partition?

View raw message