mxnet-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lin Yuan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MXNET-1027) Horovod Random Segfault during Training
Date Wed, 10 Oct 2018 20:30:00 GMT

    [ https://issues.apache.org/jira/browse/MXNET-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645504#comment-16645504
] 

Lin Yuan commented on MXNET-1027:
---------------------------------

Hi Carl, 

Can you also post the command/script to reproduce this error? Thanks

Lin

> Horovod Random Segfault during Training
> ---------------------------------------
>
>                 Key: MXNET-1027
>                 URL: https://issues.apache.org/jira/browse/MXNET-1027
>             Project: Apache MXNet
>          Issue Type: Bug
>          Components: Horovod
>            Reporter: Carl Yang
>            Priority: Minor
>
> setup: 8 GPUs on p3.16xlarge
> commit: most-likely Horovod branch: (0a0240113fe5a24ec2c772fd7309840ba179562a)
> nohup: ignoring input and appending output to 'nohup.out'
> INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, brightness=0.4,
contrast=0.4, data_nthreads=4, data_train='/media/ramdisk/train-passthrough.rec', data_train_idx='/media/ramdisk/train-passthrough.idx',
data_val='/media/ramdisk/val-passthrough.rec', data_val_idx='/media/ramdisk/val-passthrough.idx',
disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='3,224,224',
initializer='default', kv_store='None', load_epoch=None, loss='', lr=0.8, lr_factor=0.1, lr_step_epochs='30,60,80',
macrobatch_size=0, max_random_area=1, max_random_aspect_ratio=1.3333333333333333, max_random_h=0,
max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0,
min_random_area=0.08, min_random_aspect_ratio=0.75, min_random_scale=1, model_prefix=None,
mom=0.9, monitor=0, network='resnet-v1', num_classes=1000, num_epochs=90, num_examples=1281167,
num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0.1, random_crop=0, random_mirror=0,
random_resized_crop=1, rgb_mean='123.68,116.779,103.939', saturation=0.4, save_period=1, test_io=0,
top_k=0, warmup_epochs=10, warmup_strategy='linear', wd=0.0001)
> …
> INFO:root:Epoch[67] Batch [1140-1160]   Speed: 334.12 samples/sec       accuracy=0.710156
> INFO:root:Epoch[67] Batch [1140-1160]   Speed: 335.77 samples/sec       accuracy=0.719922
> INFO:root:Epoch[67] Batch [1140-1160]   Speed: 334.73 samples/sec       accuracy=0.714063
> INFO:root:Epoch[67] Batch [1140-1160]   Speed: 334.85 samples/sec       accuracy=0.721875
> INFO:root:Epoch[67] Batch [1140-1160]   Speed: 334.34 samples/sec       accuracy=0.711719
> INFO:root:Epoch[67] Batch [1140-1160]   Speed: 333.82 samples/sec       accuracy=0.714844
> INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.31 samples/sec       accuracy=0.722656
> INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.31 samples/sec       accuracy=0.705859
> INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.25 samples/sec       accuracy=0.712891
> INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.31 samples/sec       accuracy=0.723828
> INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.26 samples/sec       accuracy=0.717969
> INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.71 samples/sec       accuracy=0.716016
> INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.03 samples/sec       accuracy=0.722656
> INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.27 samples/sec       accuracy=0.716797
> Segmentation fault: 11
> Stack trace returned 8 entries:
> [bt] (0) /home/ubuntu/master/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f7233aacaeb]
> [bt] (1) /home/ubuntu/master/lib/libmxnet.so(+0x3e4d74f) [0x7f7236b9a74f]
> [bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f724a0be4b0]
> [bt] (3) /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(horovod::MX::HandleManager::ExecuteCallback(int)+0x19)
[0x7f7227ef7009]
> [bt] (4) /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x24b2b)
[0x7f7227edab2b]
> [bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f7180a6bc80]
> [bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f724a45a6ba]
> [bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f724a19041d]
> Segmentation fault: 11
> Stack trace returned 9 entries:
> [bt] (0) /home/ubuntu/master/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f7233aacaeb]
> [bt] (1) /home/ubuntu/master/lib/libmxnet.so(+0x3e4d74f) [0x7f7236b9a74f]
> [bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f724a0be4b0]
> [bt] (3) /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(horovod::MX::HandleManager::ExecuteCallback(int)+0x19)
[0x7f7227ef7009]
> [bt] (4) /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x259fc)
[0x7f7227edb9fc]
> [bt] (5) /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x30e6a)
[0x7f7227ee6e6a]
> [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f7180a6bc80]
> [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f724a45a6ba]
> [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f724a19041d]
> terminate called after throwing an instance of 'std::system_error'
>   what():  Resource deadlock avoided
> [ip-172-31-9-223:33837] *** Process received signal ***
> [ip-172-31-9-223:33837] Signal: Aborted (6)
> [ip-172-31-9-223:33837] Signal code:  (-6)
> [ip-172-31-9-223:33837] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f724a464390]
> [ip-172-31-9-223:33837] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f724a0be428]
> [ip-172-31-9-223:33837] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f724a0c002a]
> [ip-172-31-9-223:33837] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x16d)[0x7f7180a4284d]
> [ip-172-31-9-223:33837] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d6b6)[0x7f7180a406b6]
> [ip-172-31-9-223:33837] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c6a9)[0x7f7180a3f6a9]
> [ip-172-31-9-223:33837] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2e5)[0x7f7180a40005]
> [ip-172-31-9-223:33837] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0xff83)[0x7f718058af83]
> [ip-172-31-9-223:33837] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0xfb)[0x7f718058b2eb]
> [ip-172-31-9-223:33837] [ 9] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x5c)[0x7f7180a4090c]
> [ip-172-31-9-223:33837] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_system_errori+0x8e)[0x7f7180a697fe]
> [ip-172-31-9-223:33837] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt6thread4joinEv+0x18)[0x7f7180a6bb88]
> [ip-172-31-9-223:33837] [12] /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x243e3)[0x7f7227eda3e3]
> [ip-172-31-9-223:33837] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x39ff8)[0x7f724a0c2ff8]
> [ip-172-31-9-223:33837] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x3a045)[0x7f724a0c3045]
> [ip-172-31-9-223:33837] [15] /home/ubuntu/master/lib/libmxnet.so(+0x3e4d786)[0x7f7236b9a786]
> [ip-172-31-9-223:33837] [16] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f724a0be4b0]
> [ip-172-31-9-223:33837] [17] /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(_ZN7horovod2MX13HandleManager15ExecuteCallbackEi+0x19)[0x7f7227ef7009]
> [ip-172-31-9-223:33837] [18] /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x259fc)[0x7f7227edb9fc]
> [ip-172-31-9-223:33837] [19] /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x30e6a)[0x7f7227ee6e6a]
> [ip-172-31-9-223:33837] [20] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f7180a6bc80]
> [ip-172-31-9-223:33837] [21] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f724a45a6ba]
> [ip-172-31-9-223:33837] [22] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f724a19041d]
> [ip-172-31-9-223:33837] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 5 with PID 0 on node ip-172-31-9-223 exited on signal
6 (Aborted).
> --------------------------------------------------------------------------



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


Mime
View raw message