mxnet-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sheng Zha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MXNET-18) Should investigate why __del__ takes a long time
Date Sat, 10 Feb 2018 01:37:00 GMT

    [ https://issues.apache.org/jira/browse/MXNET-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16359184#comment-16359184
] 

Sheng Zha commented on MXNET-18:
--------------------------------

Recently szha@ attempted at using ndarray for accuracy metric which was previously based
on numpy, and later on got reverted for performance regression due to this change for volta
8GPU resnet 50. Looking into the accuracy metric speed for this set-up, the following numbers
are found:

for previous numpy version:

throughput 5753.84

Time cost=94.509

 

for ndarray per-batch blocking version:

throughput 4456.85

Time cost=120.803

 

for latest ndarray version:

throughput 4459.25

Time cost=120.602

 

To understand why, I did some simple profiling, and here’s what I found:

cumulative time consumed on metric now takes 42.96%, versus the numpy version which takes
only 2.51%

majority of time is spent on __del__, followed by astype

The profiling results can be found at

numpy version: [http://vmprof.com/#/f0553a92-e576-410d-8784-71e185c8a39d?id=3,3,4&view=flames]

latest ndarray version: [http://vmprof.com/#/e35f2e5d-f8f2-4c7b-b955-8a58da5f8a88?id=3,2,4&view=flames]

 

It’s surprising that __del__ takes up so much time. Would you mind looking into why __del__
is taking up so much time and how we can overcome this problem?

 

To reproduce the speed numbers, I used the existing example in mxnet. The following command
runs the script without dependency on data record (even though it might look so):

python example/image-classification/train_imagenet.py --gpu 0,1,2,3,4,5,6,7 --batch-size 1024
--num-epochs 1 --data-train /data/imagenet/train-480-val-256-recordio/train.rec --data-train-idx
/data/imagenet/train-480-val-256-recordio/train.idx --data-val /data/imagenet/train-480-val-256-recordio/val.rec
--disp-batches 100 --network resnet-v1 --num-layers 50 --data-nthreads 40 --min-random-scale
0.533 --max-random-shear-ratio 0 --max-random-rotate-angle 0 --max-random-h 0 --max-random-l
0 --max-random-s 0 --dtype float16 --benchmark 1 --kv-store device

> Should investigate why __del__ takes a long time
> ------------------------------------------------
>
>                 Key: MXNET-18
>                 URL: https://issues.apache.org/jira/browse/MXNET-18
>             Project: Apache MXNet
>          Issue Type: Bug
>            Reporter: Sheng Zha
>            Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


Mime
View raw message