sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ruslan Dautkhanov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SQOOP-2981) sqoop import from jdbc: JdbcWritableBridge.readBigDecimal() takes a ton of cpu
Date Wed, 06 Jul 2016 21:07:11 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365060#comment-15365060
] 

Ruslan Dautkhanov edited comment on SQOOP-2981 at 7/6/16 9:06 PM:
------------------------------------------------------------------

Hi [~maugli],

Thanks a lot for prompt response.

Sure, I will rerun sqoop with Joeri's tool.
Not sure yet how to run it.. created https://github.com/cerndb/Hadoop-Profiler/issues/1

{quote}It would require JDK8_u60 or above. Is it a doable scenario on your side?{quote}
It should be doable.. as long as it's only sqoop client side that needs adjustment.
Is adding JAVA_HOME=/usr/java/JDK8_u60 enough in sqoop-conf/sqoop-env.sh?
We can't switch yet whole cluster to java8, still on jdk1.7.0_67-cloudera, but can switch
temporarily sqoop1-client to jdk8 through sqoop-env.sh.

{quote}The reason why I'm asking for that is the following:
JdbcWritableBridge#readBigDecimal is just a shorthand for java.sql.ResultSet#getBigDecimal(int),
so the implementation is within the Oracle JDBC driver, and thus Sqoop do not have too much
control over that method.{quote}
I can create an Oracle SR too, we have database support from Oracle too. 

{quote}Meanwhile I'll try to investigate your scenario on my side, and check if I find any
clue what could cause you performance issue (or if we could find a workaround e.g. mapping
to a different type instead of BigDecimal or so).{quote}

Thank you - strangely enough sqoop hadn't imported data types correctly from Oracle:
{quote}
$ parquet-tools schema 154ff6db-0c3d-4424-928a-c5ceed95de70.parquet
message AutoGeneratedSchema \{
  optional binary ID (UTF8);
  optional binary PERSON_SEQ_NO (UTF8);
  optional binary MERCHANTID (UTF8);
  optional binary RECENCY (UTF8);
  optional binary FREQUENCY (UTF8);
  optional binary SPEND (UTF8);
  optional binary SPENDING (UTF8);
  optional binary FREQ (UTF8);
\}

{quote}

Notice it's all strings. Compare with create table DDL on oracle side that I posted yesterday.

All columns are NUMBER on Oracle side, except PERSON_SEQ_NO which is varchar2(50).
Sqoop create parquet schema with call columns being string. I had to fix this manually.
Might be worth creating a separate sqoop jira for that - let me know and I'll post one.
Not sure if this problem with datatype mapping has to do with high cpu utilization in JdbcWritableBridge#readBigDecimal
?

Thanks,
Ruslan



was (Author: tagar):
Hi [~maugli],

Thanks a lot for prompt response.

Sure, I will rerun sqoop with Joeri's tool.
Not sure yet how to run it.. created https://github.com/cerndb/Hadoop-Profiler/issues/1

{quote}It would require JDK8_u60 or above. Is it a doable scenario on your side?{quote}
It should be doable.. as long as it's only sqoop client side that needs adjustment.
Is adding JAVA_HOME=/usr/java/JDK8_u60 enough in sqoop-conf/sqoop-env.sh?
We can't switch yet whole cluster to java8, still on jdk1.7.0_67-cloudera, but can switch
temporarily sqoop1-client to jdk8 through sqoop-env.sh.

{quote}The reason why I'm asking for that is the following:
JdbcWritableBridge#readBigDecimal is just a shorthand for java.sql.ResultSet#getBigDecimal(int),
so the implementation is within the Oracle JDBC driver, and thus Sqoop do not have too much
control over that method.{quote}
I can create an Oracle SR too, we have database support from Oracle too. 

{quote}Meanwhile I'll try to investigate your scenario on my side, and check if I find any
clue what could cause you performance issue (or if we could find a workaround e.g. mapping
to a different type instead of BigDecimal or so).{quote}

Thank you - strangely enough sqoop hadn't imported data types correctly from Oracle:
{quote}
$ parquet-tools schema 154ff6db-0c3d-4424-928a-c5ceed95de70.parquet
message AutoGeneratedSchema {
  optional binary ID (UTF8);
  optional binary PERSON_SEQ_NO (UTF8);
  optional binary MERCHANTID (UTF8);
  optional binary RECENCY (UTF8);
  optional binary FREQUENCY (UTF8);
  optional binary SPEND (UTF8);
  optional binary SPENDING (UTF8);
  optional binary FREQ (UTF8);
}

{quote}

Notice it's all strings. Compare with create table DDL on oracle side that I posted yesterday.

All columns are NUMBER on Oracle side, except PERSON_SEQ_NO which is varchar2(50).
Sqoop create parquet schema with call columns being string. I had to fix this manually.
Might be worth creating a separate sqoop jira for that - let me know and I'll post one.
Not sure if this problem with datatype mapping has to do with high cpu utilization in JdbcWritableBridge#readBigDecimal
?

Thanks,
Ruslan


> sqoop import from jdbc: JdbcWritableBridge.readBigDecimal() takes a ton of cpu
> ------------------------------------------------------------------------------
>
>                 Key: SQOOP-2981
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2981
>             Project: Sqoop
>          Issue Type: Bug
>          Components: codegen, connectors/oracle, sqoop2-jdbc-connector
>    Affects Versions: 1.4.5, 1.4.6
>         Environment: sqoop import from Oracle; saves as parquet file
>            Reporter: Ruslan Dautkhanov
>              Labels: decimal, import, jdbc, oracle, parquet
>
> Majority of time spent of sqoop import from Oracle was on converting Decimal. It was
2.5x times more than following most cpu consumer (snappy compression). 
> Sqoop was 100% on cpu in total, Oracle side was pretty bored.
> {quote}
>  JvmTop 0.8.0 alpha - 11:56:45,  amd64, 48 cpus, Linux 2.6.32-57, load avg 0.92
>  http://code.google.com/p/jvmtop
>  Profiling PID 25489: org.apache.hadoop.mapred.YarnChild 10.20
>   38.78% (     7.68s) com.cloudera.sqoop.lib.JdbcWritableBridge.readBigDecimal()
>   14.27% (     2.82s) org.xerial.snappy.SnappyNative.rawCompress()
>   12.67% (     2.51s) parquet.io.api.Binary$FromStringBinary.encodeUTF8()
>   10.28% (     2.04s) ...quet.column.values.dictionary.DictionaryValuesWriter$()
>    4.80% (     0.95s) ...quet.column.values.fallback.FallbackValuesWriter.writ()
>    3.69% (     0.73s) com.cloudera.sqoop.lib.JdbcWritableBridge.readString()
>    2.51% (     0.50s) parquet.avro.AvroWriteSupport.writeRecordFields()
>    2.30% (     0.46s) parquet.column.impl.ColumnWriterV1.write()
>    1.90% (     0.38s) ...quet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOp()
>    1.31% (     0.26s) ...quet.column.values.rle.RunLengthBitPackingHybridEncod()
>    1.27% (     0.25s) ...quet.column.values.dictionary.DictionaryValuesWriter$()
>    1.22% (     0.24s) parquet.hadoop.CodecFactory$BytesCompressor.compress()
>    0.65% (     0.13s) ...quet.column.values.dictionary.DictionaryValuesWriter$()
>    0.64% (     0.13s) ...quet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOp()
>    0.64% (     0.13s) parquet.bytes.CapacityByteArrayOutputStream.addSlab()
>    0.63% (     0.12s) parquet.io.api.Binary$ByteArrayBackedBinary.getBytes()
>    0.62% (     0.12s) ...quet.column.values.dictionary.DictionaryValuesWriter.()
>    0.58% (     0.12s) parquet.bytes.CapacityByteArrayOutputStream.setByte()
>    0.49% (     0.10s) parquet.hadoop.codec.SnappyUtil.validateBuffer()
>    0.44% (     0.09s) parquet.hadoop.InternalParquetRecordWriter.write()
> {quote}
> DDL of the table on Oracle side:
> {quote}
> CREATE TABLE someschema.sometable
> (
>   ID             NUMBER                         NOT NULL,
>   psn  VARCHAR2(50 BYTE),
>   MERCHID     NUMBER,
>   RECENCY        NUMBER,
>   FREQUENCY      NUMBER,
>   SPEND          NUMBER,
>   SPENDING       NUMBER,
>   FREQ           NUMBER
> )
> {quote}
> Sqoop parameters:
> {quote}
> sqoop import 
>         -Dmapred.job.name="sqoop import into out_table" 
>         --connect "jdbc:oracle:thin:@jdbc_tns" 
>         --username username --password password
>         --direct 
>         --compress --compression-codec snappy 
>         --as-parquetfile 
>         --target-dir hdfs_dir
>         --num-mappers num_mappers
>         --query "SELECT * FROM someschema.sometable WHERE \$CONDITIONS"
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message