sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-2920) sqoop performance deteriorates significantly on wide datasets; sqoop 100% on cpu
Date Thu, 19 May 2016 14:31:13 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291169#comment-15291169
] 

ASF subversion and git services commented on SQOOP-2920:
--------------------------------------------------------

Commit ac217a032c0755edac713ff53b3f55b7f2f46706 in sqoop's branch refs/heads/trunk from [~jarcec]
[ https://git-wip-us.apache.org/repos/asf?p=sqoop.git;h=ac217a0 ]

Revert "SQOOP-2920: sqoop performance deteriorates significantly on wide datasets; sqoop 100%
on cpu"

I've mistakenly committed SQOOP-2920 and SQOOP-2906 inside this commit,
so I'll revert it and commit them separately.


> sqoop performance deteriorates significantly on wide datasets; sqoop 100% on cpu
> --------------------------------------------------------------------------------
>
>                 Key: SQOOP-2920
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2920
>             Project: Sqoop
>          Issue Type: Bug
>          Components: connectors/oracle, hive-integration, metastore
>    Affects Versions: 1.4.5, 1.4.6
>         Environment: - sqoop export on a very wide dataset (over 700 columns)
> - sqoop export to oracle
> - subset of columns is exported (using --columns argument)
> - parquet files
> - --table --hcatalog-database --hcatalog-table options are used
>            Reporter: Ruslan Dautkhanov
>            Assignee: Attila Szabo
>            Priority: Critical
>              Labels: columns, hive, oracle, perfomance
>             Fix For: 1.4.7
>
>         Attachments: SQOOP-2920.patch, jstack.zip, top - sqoop mappers hog cpu.png
>
>
> We sqoop export from datalake to Oracle quite often.
> Every time we sqoop "narrow" datasets, Oracle always have scalability issues (3-node
all-flash Oracle RAC) normally can't keep up with more than 45-55 sqoop mappers. Map-reduce
framework shows sqoop mappers are not so loaded. 
> On wide datasets, this picture is quite opposite. Oracle shows 95% of sessions are bored
and waiting for new INSERTs. Even when we go over hundred of mappers. Sqoop has serious scalability
issues on very wide datasets. (Our company normally has very wide datasets)
> For example, on the last sqoop export:
> Started ~2.5 hours ago and 95 mappers already accumulated
> CPU time spent (ms)	1,065,858,760
> (looking at this metric through map-reduce framework stats)
> 1 million seconds of CPU time.
> Or 11219.57 per mapper. Which is roughly 3.11 hours of CPU time per mapper. 
> So they are 100% on cpu.
> Will also attach jstack files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message