Nvidia (10.0) for TensorFlow & PyTorch on Fedora 30

Fedora 30 now has Python 3.7

So let's rebuild our virtualenv :

sudo dnf install python3-virtualenv

virtualenv --system-site-packages -p python3.7 env37

. env37/bin/activate
pip3 install https://download.pytorch.org/whl/cu100/torch-1.1.0-cp37-cp37m-linux_x86_64.whl
pip3 install https://download.pytorch.org/whl/cu100/torchvision-0.3.0-cp37-cp37m-linux_x86_64.whl

pip install tf-nightly-gpu-2.0-preview  # 8hrs old...

BUT ... TF hasn't got to cuda 10.1 yet

Unfortunately, once again, TensorFlow goes against the grain whereas the negativo repo wants to keep me up-to-data (latest cuda is v10.1).

So: Reinstall cuda and cudnn (but leave negativo in charge of the Nvidia driver, since that process, including kernel recompilation, etc, works well).

First, get rid of the existing cuda files (all the below should be done as root) :

dnf remove cuda

# Which should disappear the following :

#/usr/lib64/libcuda.so.1
#/usr/lib64/libcuda.so.390.12
#/usr/lib64/pkgconfig/cuda.pc
#/usr/lib64/pkgconfig/cudart.pc
#/usr/lib64/libcudart.so.10.1.168
#/usr/lib64/libicudata.so.63

#/usr/include/cuda/cuda.h

Download the local installer, and use it:

wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux

sh cuda_10.0.130_410.48_linux --override  # Ignore gcc version complaints

Which (following the prompts) gives:

Do you accept the previously read EULA?
accept/decline/quit: accept

You are attempting to install on an unsupported configuration. Do you wish to continue?
(y)es/(n)o [ default is no ]: yes

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
(y)es/(n)o/(q)uit: no

Install the CUDA 10.0 Toolkit?
(y)es/(n)o/(q)uit: yes

Enter Toolkit Location
 [ default is /usr/local/cuda-10.0 ]:

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: yes

Install the CUDA 10.0 Samples?
(y)es/(n)o/(q)uit: no

Installing the CUDA Toolkit in /usr/local/cuda-10.0 ...

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-10.0
Samples:  Not Selected

Please make sure that
 -   PATH includes /usr/local/cuda-10.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.

***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 10.0 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run -silent -driver

Logfile is /tmp/cuda_install_6715.log

#/usr/local/cuda-10.0/lib64/libcudart.so
#/usr/local/cuda-10.0/lib64/libcudart.so.10.0
#/usr/local/cuda-10.0/lib64/libcudart_static.a
#/usr/local/cuda-10.0/lib64/libcudart.so.10.0.130

Also obtain cudnn (which doesn't work via direct download, since Nvidia requires a login...) :

# After downloading in a browser as a regular user :
mv /home/username/Downloads/cudnn-10.0-linux-x64-v7.6.1.34.tgz .
tar -xzf cudnn-10.0-linux-x64-v7.6.1.34.tgz

# And move over the files parallel to the libcudart.so ones above :
cp cuda/include/cudnn.h /usr/local/cuda-10.0/include/

cp cuda/lib64/libcudnn.so.7.6.1 /usr/local/cuda-10.0/lib64/
ln -s /usr/local/cuda-10.0/lib64/libcudnn.so.7.6.1 /usr/local/cuda-10.0/lib64/libcudnn.so.7
ln -s /usr/local/cuda-10.0/lib64/libcudnn.so.7 /usr/local/cuda-10.0/lib64/libcudnn.so

cp cuda/lib64/libcudnn_static.a /usr/local/cuda-10.0/lib64/

# So that :
ls -l /usr/local/cuda-10.0/lib64/ | grep cudnn
# Looks like :
#lrwxrwxrwx. 1 root root        13 Jun 20 06:59 libcudnn.so -> libcudnn.so.7
#lrwxrwxrwx. 1 root root        17 Jun 20 06:59 libcudnn.so.7 -> libcudnn.so.7.6.1
#-rwxrwxr-x. 1 root root 390137496 Jun 20 06:04 libcudnn.so.7.6.1
#-rw-rw-r--. 1 root root 389213742 Jun 20 06:04 libcudnn_static.a

Finally, let the machine know that there's a new cuda in town :

echo '/usr/local/cuda-10.0/lib64' >> /etc/ld.so.conf
ldconfig

Test that it works

Go into the python within the v3.7 virtualenv already set up, and :

import torch

#dtype = torch.FloatTensor  # Use this to run on CPU
dtype = torch.cuda.FloatTensor # Use this to run on GPU

a = torch.Tensor( [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]).type(dtype)
b = torch.Tensor( [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]).type(dtype)

c = a.mm(b)

# As usual ...
print(c)  # matrix-multiply (should state : device='cuda:0')

print(c.device)  # Hope for : 'cuda:0'


# And now tensorflow (has to be second, because it's "greedy")

# See : https://www.tensorflow.org/beta/guide/using_gpu
import tensorflow as tf

tf.debugging.set_log_device_placement(True)

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], shape=[2, 3], name='a')
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]], shape=[3, 2], name='b')
c = tf.matmul(a, b)

# Eager mode FTW!
print(c)

print(c.device)  # Hope for : /job:localhost/replica:0/task:0/device:GPU:0

Add a few more libraries

(According to taste, etc) :

pip install jupyter