Recently, I moved my GPU to my home office, where the machine is suspended overnight, rather
than being left on continuously. But that also caused the Nvidia card to present a
problem where it sometimes became unusable after the machine was suspended.
Solution : Kill running GPU processes
Finally, I noticed that the problem didn’t occur if there was nothing at all running
on the card during the suspend/resume cycle.
Look at the output of nvidia-smi if a process is running :
Here’s the output of nvidia-smi if no process is running :
The card isn’t doing any active computation. However, simply running
a Jupyter notebook that imports tensorflow or PyTorch is enough
to create a process on the card, which causes the GPU to ‘lose connection’ after a
So : Before suspending, stop not only the active GPU machine learning things, but
also things (like Jupyter) that may be keeping the card occupied.
Here’s the output of nvidia-smi if no process is running on a GTX 760 (look at the temperature, and the memory usage) :