Case Report: cudaGetDeviceCount()

Report on CUDA Driver Error and Resolution

Overview

While utilizing the research lab’s server for model training, I encountered an issue that could potentially affect other users. This report documents the error and its resolution to assist others who might face a similar problem.

Error Details

During model training, the following error message was encountered, preventing the use of the GPU:

RuntimeError: Unexpected error from cudaGetDeviceCount().
Did you run some cuda functions before calling NumCudaDevices() that might have already set an error?
Error 804: forward compatibility was attempted on non supported HW

Additionally, when attempting to check the GPU status using the nvidia-smi command, the following error was displayed:

Failed to initialize NVML: Driver/library version mismatch

Cause

After investigation, the issue was determined to arise from incompatibility or conflict between the CUDA environment, kernel, and the NVIDIA driver. However, since the GPU was functioning normally just moments before the error occurred, a driver compatibility issue was ruled out.

Further research revealed that such errors can occasionally occur due to kernel and driver conflicts during CUDA usage, even if there is no inherent compatibility problem.

Solution

The easiest solution in such cases is a system reboot. However, since the server is remotely located in Gunsan, a reboot posed a risk of the system not restarting correctly. Instead, I resolved the issue by unloading and reloading the NVIDIA driver modules.

Steps to Resolve

  1. Terminate Processes Using NVIDIA Driver Modules
    Use the following command to identify processes utilizing the NVIDIA driver:
     sudo lsof /dev/nvidia*
    

    In my case, the display server was using the driver. To stop the display server, I executed:

     sudo systemctl stop gdm
     sudo systemctl stop lightdm
    

    Adapt these steps based on the processes identified in your system.

  2. Unload NVIDIA Driver Modules Run the following commands in sequence to unload the NVIDIA driver modules:
     sudo rmmod nvidia_uvm
     sudo rmmod nvidia_drm
     sudo rmmod nvidia_modeset
     sudo rmmod nvidia
    
  3. Reload NVIDIA Driver Modules Reload the modules using the following commands:
     sudo modprobe nvidia
     sudo modprobe nvidia_modeset
     sudo modprobe nvidia_drm
     sudo modprobe nvidia_uvm
    
  4. Restart the Display Server Finally, restart the display server using:
     sudo systemctl start gdm
     sudo systemctl start lightdm
    

Outcome

Following these steps, the error was resolved, and I was able to successfully run the training code again.

Conclusion

This documentation aims to assist anyone encountering similar GPU-related errors. Should you face this issue, following the outlined steps should help restore your system’s functionality without the need for a full reboot or driver reinstallation.