Debugging tips

RuntimeError: CUDA error: device-side assert triggered

CUDA에러에서는 에러가 어디에서 났는지 정확히 안 알려주기 때문에 CUDA_LAUNCH_BLOCKING=1을 사용한다.

ex. CUDA_LAUNCH_BLOCKING = 1 python train.py

새 모델 돌리기 전 용량 확보하기

1) df -h를 사용하여 현재 용량을 확인한다.

2) du -h를 사용하여 가장 용량을 많이 차지하는 파일을 찾는다.

3) rm -rf를 사용하여 2에서 찾은 파일을 삭제한다.

4) 원하는 용량을 확보할 때까지 반복한다.

* 특정 용량 이상의 파일을 찾고싶은 경우

find /opt/ml -size +1G -exec ls -l {} \;

CUDA OOM

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 31.75 GiB total capacity; 29.47 GiB already allocated; 207.50 MiB free; 30.06 GiB reserved in total by PyTorch)

batch size를 최대한 낮추고 (1까지) fp16을 연결할 수 있으면 해본다.

RuntimeError: "ms_deform_attn_forward_cuda" not implemented for 'Half'

어떤 함수가 Half에 대해 implemented되지 않았다는 것은 float16에 대해 implemented되지 않았다는 의미이다.

input의 type을 float32로 바꿔준다.

ex. value = value.type(torch.float32)

까먹을까봐 쓰는 CPU에서 GPU 연결하는 방법 세가지

tensor.cuda()

torch.(tensor, device='cuda')

tensor.to('cuda')

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.

--> torch.multiprocessing.set_start_method('spawn')을 추가한다.

debugging tip은 아니지만, 컴퓨터를 끄고도 계속 프로그램을 돌리고싶을 때:

nuhup을 사용하자.

ex. nohup sh train.sh

중간에 멈추고싶을 때는 ps -ef로 실행한 명령어의 PID를 찾아서

kill -9 PID를 실행해주면 된다.

저작자표시

한별쓰로그

Debugging tips

티스토리툴바