NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

unraid nvidia-smi has failed because it couldn t communicate with the nvidia driver
nvidia drivers
install nvidia drivers ubuntu
nvidia-smi has failed because it couldn't communicate with the nvidia driver arch linux
error: nvidia driver is not loaded
nvidia-smi internal error
nvidia-smi err
nvidia-smi not working

I'm running an AWS EC2 g2.2xlarge instance with Ubuntu 14.04 LTS. I'd like to observe the GPU utilization while training my TensorFlow models. I get an error trying to run 'nvidia-smi'.

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig
nvidia-cuda-mps-control  nvidia-persistenced
nvidia-cuda-mps-server   nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia 
ii  nvidia-346                                            352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-346
ii  nvidia-346-dev                                        346.46-0ubuntu1                                     amd64        NVIDIA binary Xorg driver development files
ii  nvidia-346-uvm                                        346.96-0ubuntu0.0.1                                 amd64        Transitional package for nvidia-346
ii  nvidia-352                                            375.26-0ubuntu1                                     amd64        Transitional package for nvidia-375
ii  nvidia-375                                            375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary driver - version 375.39
ii  nvidia-375-dev                                        375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary Xorg driver development files
ii  nvidia-modprobe                                       375.26-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-icd-346                                 352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-opencl-icd-352
ii  nvidia-opencl-icd-352                                 375.26-0ubuntu1                                     amd64        Transitional package for nvidia-opencl-icd-375
ii  nvidia-opencl-icd-375                                 375.39-0ubuntu0.14.04.1                             amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                          0.6.2.1                                             amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                       375.26-0ubuntu1                                     amd64        Tool for configuring the NVIDIA graphics driver
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ 

$ inxi -G
Graphics:  Card-1: Cirrus Logic GD 5446 
           Card-2: NVIDIA GK104GL [GRID K520] 
           X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X

$  lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
    Subsystem: XenSource, Inc. Device 0001
    Kernel driver in use: cirrus
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
    Subsystem: NVIDIA Corporation Device 1014
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

I followed these instructions to install CUDA 7 and cuDNN:

$sudo apt-get -q2 update
$sudo apt-get upgrade
$sudo reboot

=======================================================================

Post reboot, update the initramfs by running '$sudo update-initramfs -u'

Now, please edit the /etc/modprobe.d/blacklist.conf file to blacklist nouveau. Open the file in an editor and insert the following lines at the end of the file.

blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off

Save and exit from the file.

Now install the build essential tools and update the initramfs and reboot again as below:

$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential
$sudo update-initramfs -u
$sudo reboot

========================================================================

Post reboot, run the following commands to install Nvidia.

$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
$sudo chmod 700 ./cuda_7.0.28_linux.run
$sudo ./cuda_7.0.28_linux.run
$sudo update-initramfs -u
$sudo reboot

========================================================================

Now that the system has come up, verify the installation by running the following.

$sudo modprobe nvidia
$sudo nvidia-smi -q | head`enter code here`

You should see the output like 'nvidia.png'.

Now run the following commands. $

cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
$make
$./deviceQuery

However, 'nvidia-smi' still doesn't show GPU activity while Tensorflow is training models:

ubuntu@ip-10-0-1-48:~$ ipython
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import tensorflow as tf 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally



ubuntu@ip-10-0-1-48:~$ nvidia-smi
Thu Mar 30 05:45:26 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   35C    P0    38W / 125W |     10MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I solved "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04 by disabling Secure Boot Control from BIOS.

"NVIDIA-SMI has failed because it couldn't communicate with the , When running nvidia-smi, I get nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the  nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. It looks like the nvidia 375 version is instaled but I can't make it works. whereis nvidia nvidia: /usr/lib/nvidia /usr/share/nvidia /usr/src/nvidia-375-375.66/nvidia

I was getting the same error on my Ubuntu 16.04 (Linux 4.14 kernel) in Google Compute Engine with K80 GPU. I upgraded the kernel to 4.15 from 4.14 and boom the problem was solved. Here is how I upgraded my Linux kernel from 4.14 to 4.15:

Step 1:
Check the existing kernel of your Ubuntu Linux:

uname -a

Step 2:

Ubuntu maintains a website for all the versions of kernel that have 
been released. At the time of this writing, the latest stable release 
of Ubuntu kernel is 4.15. If you go to this 
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will 
see several links for download.

Step 3:

Download the appropriate files based on the type of OS you have. For 64 
bit, I would download the following deb files:

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

Step 4:

Install all the downloaded deb files:

sudo dpkg -i *.deb

Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a

You should see that your kernel has been upgraded and hopefully nvidia-smi should work.

NVIDIA-SMI has failed because it couldn't communicate with the , If your nvidia-smi failed to communicate but you've installed the driver so many times, check prime-select . Run prime-select query to get all  NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. This issue is easily fixed by re-installing the drivers and rebooting: sudo apt-get update sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384 nvidia-opencl-icd-384 sudo reboot

I am working with a AWS DeepAMI P2 instance and suddenly I found that Nvidia-driver command doesn't working and GPU is not found torch or tensorflow library. Then I have resolved the problem in the following way,

Run nvcc --version if it doesn't work

Then run the following

apt install nvidia-cuda-toolkit

Hopefully that will solve the problem.

NVIDIA-SMI has failed because it couldn't , I solved "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04  NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Run the following to get the right NVIDIA driver :

sudo ubuntu-drivers devices

Then pick the right and run:

sudo apt install

NVIDIA-SMI has failed because it couldn't , can report query information as XML or human readable plain text to either standard output or a file. Running nvidia-smi results in the error: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. The fact is that I have the driver installed and I was using them until the night before (everything just stopped working during the night).

I just want to thank @Heapify for providing a practical answer and update his answer because the attached links are not up-to-date.

Step 1: Check the existing kernel of your Ubuntu Linux:

uname -a

Step 2:

Ubuntu maintains a website for all the versions of kernel that have been released. At the time of this writing, the latest stable release of Ubuntu kernel is 4.15. If you go to this link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will see several links for download.

Step 3:

Download the appropriate files based on the type of OS you have. For 64 bit, I would download the following deb files:

// UP-TO-DATE 2019-03-18
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

Step 4:

Install all the downloaded deb files:

sudo dpkg -i *.deb

Step 5:

Reboot your machine and check if the kernel has been updated by:

uname -aenter code here

NVIDIA System Management Interface, NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. That means that You selected invalid nvidia driver to installation. How in easy  The nvidia-smi command work fine immediately after 440.82 installation, but was failed and showed "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

How to get the version of my nvidia driver?, After an update to my AWS Deep Learning AMI (Ubuntu) Version 21.2 instance, I noticed that running an epoch that would normally take 30  NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. If the output you get is something like this, then you have to make sure you are using the correct VIB. The only supported VIB is the one downloaded from the Nvidia Licensing portal.

How to install the NVIDIA drivers on Ubuntu 18.04 Bionic Beaver , I just recently had same issue. To make it work what I did was: Deleted all nvidia packages via synaptic (maybe it's enough to sudo apt purge  NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. I installed and reinstalled nvidia drivers for different versions and still have the same problem.

[Solved]: Deep Learning on GPU and error “NVIDIA-SMI has failed , Hi I got this error when I run the command nvidia-smi in my VM : NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.

Comments
  • Works for me. Now I can use CUDA again.
  • Worked on Dell Inspiron 7460 with 940MX. Thanks a lot!
  • This worked for me. FYI ASUS laptop with 940m
  • Disabling Secure Boot worked for Acer Aspire VN7 with Geforce-950M.
  • I don't think so it's good idea to disable the secure boot.you can enroll into MOK (Machine owner key) then you do not need to disable the Secure boot.
  • This worked for me and the nvidia-390 driver, it modprobes now, I updated from kernel 4.4.0 to 4.15.0.
  • After wasting 4+ hours, this one solved my problem. Thank you.
  • Did you upgrade to 4.14 (as stated in the text) or to 4.15 (as shown in the code)? I'm running 4.15 and have the same issue
  • Sorry for the confusion, Looks like there was some typo. Just edited it
  • This works for me. In my case, a reboot is needed in order for nvidia-smi works again.
  • How could this possibly help?
  • How do you do this on an AWS instance ?