Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InstanceNorm v0.5: cudaErrorMisalignedAddress #246

Closed
zgojcic opened this issue Oct 19, 2020 · 9 comments
Closed

InstanceNorm v0.5: cudaErrorMisalignedAddress #246

zgojcic opened this issue Oct 19, 2020 · 9 comments
Labels
bug Something isn't working v0.5

Comments

@zgojcic
Copy link

zgojcic commented Oct 19, 2020

Hi Chris,

I have tried updating ME to 0.5 and I have some problems when using the InstanceNorm layer (see error bellow), if I replace InstanceNorm with BarchNorm it works without a problem. The same model with InstanceNorm works normally when using ME v0.4.3.

The error only occurs during training, if the model is put in the eval mode with torch.no_grad() it works without a problem and returns the same results as the v0.4.3. Using the v0.5 makes the inference 2 times faster 👍

RuntimeError:  misaligned address at /home/zgojcic/Documents/holistic_scene_flow/class_net_me_05/MinkowskiEngine/src/pooling_avg_kernel.cu:299
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  CUDA free failed: cudaErrorMisalignedAddress: misaligned address

Thanks for your help.

Best
Zan

@zgojcic
Copy link
Author

zgojcic commented Oct 19, 2020

To provide a bit more information:

Python: 3.7.9
Cuda: 10.1.243
torch: 1.6.0

Unit test of normalization runs through normally (with different batch sizes. feat dimension and even when adding a conv layer before normalization).

It is actually not putting the model in the eval mode that matters but the batch size, with batch size 1 it runs ok with more than 1 it crashes. I have reduced the number of point to a very small number it still fails. On CPU it works ok with all batch sizes.

With different datasets sometimes it crashes at the first IN layer sometimes at the second (Architecture is similar to FCGF Encoder) changing the kernel size of the conv layer before normalization to 1 does not help.

I have tried both ME.utils.batched_coordinates() and ME.utils.sparse_collate() to generate the batch coordinates and features.

All features are float32 data type.

Thanks again for you help

Best
Zan

@chrischoy chrischoy added bug Something isn't working v0.5 labels Oct 30, 2020
@chrischoy
Copy link
Contributor

Hmm I'm having difficulty replicating this error on 10.1.243, 10.2, 11.0. Could you post a short script that replicates this error?

@chrischoy
Copy link
Contributor

Temporarily disable the global pooling kernel on 010a39c. To use the native cuda calls, use MinkowskiGlobalAvgPooling(mode=ME.PoolingMode.GLOBAL_AVG_POOLING_KERNEL).

@lshiwjx
Copy link

lshiwjx commented Dec 6, 2020

Thanks.

However, after using the native cuda call, it throws a new error.

RuntimeError:  misaligned address at /tmp/pip-req-build-b47llihh/src/pooling_avg_kernel.cu:299
terminate called after throwing an instance of 'thrust::system::system_error'
 what():  CUDA free failed: cudaErrorMisalignedAddress: misaligned address

@chrischoy
Copy link
Contributor

This is the same error. Could you post a minimal reproducible code?

@Andy97
Copy link

Andy97 commented Jan 26, 2021

Replace all ME.MinkowskiBatchNorm to ME.MinkowskiInstanceNorm in examples\resnet.py
And modify line 86 function

    def weight_initialization(self):
        for m in self.modules():
            if isinstance(m, ME.MinkowskiConvolution):
                ME.utils.kaiming_normal_(m.kernel, mode="fan_out", nonlinearity="relu")

            if isinstance(m, ME.MinkowskiBatchNorm):
                nn.init.constant_(m.bn.weight, 1)
                nn.init.constant_(m.bn.bias, 0)

to

    def weight_initialization(self):
        for m in self.modules():
            if isinstance(m, ME.MinkowskiConvolution):
                ME.utils.kaiming_normal_(m.kernel, mode="fan_out", nonlinearity="relu")

            if isinstance(m, ME.MinkowskiInstanceNorm):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

Reproduce the error.

@chrischoy
Copy link
Contributor

chrischoy commented Jan 31, 2021

No, it doesn't reproduce the error on

==========System==========
Linux-5.4.0-65-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0]
==========Pytorch==========
1.7.0
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 450.102.04
CUDA Version 11.0
VBIOS Version 90.02.2E.00.0C
Image Version G001.0000.02.04
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
==========CC==========
/usr/bin/c++
c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.0
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010

@Andy97 could you post the output of

import MinkowskiEngine as ME
ME.print_diagnostics()

@chrockey
Copy link
Contributor

chrockey commented Feb 1, 2021

I have the same error on the following environments in docker container.

==========System==========
Linux-5.4.0-53-generic-x86_64-with-debian-buster-sid
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
3.7.9 (default, Aug 31 2020, 12:42:55) 
[GCC 7.3.0]
==========Pytorch==========
1.7.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 450.102.04
CUDA Version 11.0
VBIOS Version 90.02.42.00.0F
Image Version G001.0000.02.04
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
==========CC==========
/usr/bin/c++
c++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.0
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 10020
CUDART version MinkowskiEngine is compiled: 10020

and

==========System==========
Linux-5.4.0-53-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
3.8.5 (default, Sep  4 2020, 07:30:14)
[GCC 7.3.0]
==========Pytorch==========
1.7.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 450.102.04
CUDA Version 11.0
VBIOS Version 90.02.42.00.0F
Image Version G001.0000.02.04
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
==========CC==========
/usr/bin/c++
c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.0
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010

@zgojcic
Copy link
Author

zgojcic commented Feb 1, 2021

Hi Chris, sorry I have slightly forgotten about this issue. Below is the info from one computer, I observe the same on at least one more pc (if you need, I can get also the diagnostics from that one).

==========System==========
Linux-4.15.0-132-generic-x86_64-with-debian-stretch-sid
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
3.6.12 |Anaconda, Inc.| (default, Sep 8 2020, 23:10:56)
[GCC 7.3.0]
==========Pytorch==========
1.7.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 418.87.00
CUDA Version 10.1
VBIOS Version 86.04.17.00.01
Image Version G001.0000.01.03
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
==========CC==========
CC=g++-7
/usr/bin/g++-7
g++-7 (Ubuntu 7.5.0-3ubuntu1~16.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.0
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 10010
CUDART version MinkowskiEngine is compiled: 10010

chrischoy added a commit that referenced this issue Feb 4, 2021
Tanazzah pushed a commit to Tanazzah/MinkowskiEngine that referenced this issue Feb 9, 2024
Tanazzah pushed a commit to Tanazzah/MinkowskiEngine that referenced this issue Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v0.5
Projects
None yet
Development

No branches or pull requests

5 participants